Some Thoughts on Teaching R to 50,000 Students

Two weeks ago I finished teaching my course Computing for Data Analysis through Coursera. Since then I’ve had some time to think about how it went, what I learned, and what I’d do differently.

First off, let me say that it was a lot of fun. Seeing thousands of people engaged in the material you’ve developed is an incredible experience and unlike any I’ve seen before. I initially had a number of fears about teaching this course, the primary one being that it would be a lot of work. Managing the needs of 50,000 students seemed like it would be a nightmare and making sure everything worked for every single person seemed impossible.

These fears were ultimately unfounded. The Coursera platform is quite nice and is well-designed to scale to very large MOOCs. Everything is run off of Amazon S3 and so scalability is not an issue (although Hurricanes are a different story!) and there are numerous tools provided to help with automatic grading. Quizzes were multiple choice for me, so that gave instant feedback to students, but there are options to grade via regular expressions. For programming assignments, grading was done via unit tests, so students would feed pre-selected inputs into their R functions and the output would be checked on the Coursera server. Again, this allowed for automatic instant feedback without any intervention on my part. Designing programming assignments that would be graded by unit tests was a bit restrictive for me, but I think that was mostly because I wasn’t that used to it. On my end, I had to learn about video editing and screen capture, which wasn’t too bad. I mostly used Camtasia for Mac (highly recommended) for the lecture videos and occasionally used Final Cut Pro X.

Coursera is working hard on their platform and so I imagine there will be many improvements in the near future (some of which were actually rolled out as the course was running). The system feels like it was designed and written by a bunch of Stanford CS grad students—and lo and behold it was! I think it’s a great platform for teaching computing, but I don’t know how well it’ll work for, say, Modern Poetry. But we’ll see, I guess.

Here is some of what I took away from this experience:

  • 50,000 students is in some ways easier than 50 students. When I teach my in-class version of this course, I try to make sure everyone’s keeping up and doing well. I learn everyone’s names. I read all their homeworks. With 50,000 students there’s no pretension about individual attention. Everyone’s either on their own or has to look to the community for help. I did my best to participate in the discussion forums, but the reality was that the class community was incredibly helpful and participating in it was probably a better experience for some students than just having me to talk to.
  • Clarity and specificity are necessary. I’ve never taught a course online before, so I was used to the way I create assignments in-class. I just jot down some basic goals and problems and then clarify things in class if needed. But here, the programming assignments really had to be clear (akin to legal documents) because trying to clear up confusion afterwards often led to more confusion. The result is that it took a lot more time to write homework assignments for this class than for the same course in-person (even if it was the same homework) because I was basically writing a software specification.
  • Modularity is key to overcoming heterogeneity. This was a lesson that I didn’t figure out until the middle of the course when it was basically too late. In any course, there’s heterogeneity in the backgrounds of the students. In programming classes, some students have programmed in other languages before while some have never programmed at all. Handling heterogeneity is a challenge in any course. Now, just multiply that by 10,000 and that’s what this course was. Breaking everything down into very small pieces is key to letting people across the skill spectrum move at their own pace. I thought I’d done this but in reality I hadn’t broken things down into small enough pieces. The result was that the first homework was a beast of a problem for those who had little programming experience. 
  • Time and content are more loosely connected. Preparing for this course exposed a feature of in-class courses that I’d not thought about. In-class courses for me are very driven by the clock and the calendar. I teach twice a week, each period is 1.5 hours, and there are 8 weeks in the term. So I need to figure out how to fit material into exact 1.5 hour blocks. If something only takes 1 hour to cover then I need to cover part of the next topic, find a topic that’s short, or just fill for half an hour. While preparing for this course, I found myself just thinking about what content I wanted to cover and just doing it. I tried to target about 2 hours of video per week, but there was obviously some flexibility. In class, there’s no flexibility because usually the next class is trampling over you as the period ends. Not having to think about exact time was very liberating.

I’m grateful for all the students I had in this first offering of the course I thank them for putting up with my own learning process as I taught it. I’m hoping to offer this course again on Coursera but I’m not sure when that’ll be. If you missed the Coursera version of Computing for Data Analysis, I will be offering a version of this course through the blog very shortly. Please check here back for details.

Tags: R MOOC

On weather forecasts, Nate Silver, and the politicization of statistical illiteracy

As you know, we have a thing for statistical literacy here at Simply Stats. So of course this column over at Politico got our attention (via Chris V. and others). The column is an attack on Nate Silver, who has a blog where he tries to predict the outcome of elections in the U.S., you may have heard of it…

The argument that Dylan Byers makes in the Politico column is that Nate Silver is likely to be embarrassed by the outcome of the election if Romney wins. The reason is that Silver’s predictions have suggested Obama has a 75% chance to win the election recently and that number has never dropped below 60% or so. 

I don’t know much about Dylan Byers, but from reading this column and a quick scan of his twitter feed, it appears he doesn’t know much about statistics. Some people have gotten pretty upset at him on Twitter and elsewhere about this fact, but I’d like to take a different approach: education. So Dylan, here is a really simple example that explains how Nate Silver comes up with a number like the 75% chance of victory for Obama. 

Let’s pretend, just to make the example really simple, that if Obama gets greater than 50% of the vote, he will win the election. Obviously, Silver doesn’t ignore the electoral college and all the other complications, but it makes our example simpler. Then assume that based on averaging a bunch of polls  we estimate that Obama is likely to get about 50.5% of the vote.

Now, we want to know what is the “percent chance” Obama will win, taking into account what we know. So let’s run a bunch of “simulated elections” where on average Obama gets 50.5% of the vote, but there is variability because we don’t have the exact number. Since we have a bunch of polls and we averaged them, we can get an estimate for how variable the 50.5% number is. The usual measure of variance is the standard deviation. Say we get a standard deviation of 1% for our estimate. That would be a pretty accurate number, but not totally unreasonable given the amount of polling data out there. 

We can run 1,000 simulated elections like this in R* (a free software programming language, if you don’t know R, may I suggest Roger’s Computing for Data Analysis class?). Here is the code to do that. The last line of code calculates the percent of times, in our 1,000 simulated elections, that Obama wins. This is the number that Nate would report on his site. When I run the code, I get an Obama win 68% of the time (Obama gets greater than 50% of the vote). But if you run it again that number will vary a little, since we simulated elections. 

The interesting thing is that even though we only estimate that Obama leads by about 0.5%, he wins 68% of the simulated elections. The reason is that we are pretty confident in that number, with our standard deviation being so low (1%). But that doesn’t mean that Obama will win 68% of the vote in any of the elections! In fact, here is a histogram of the percent of the vote that Obama wins: 

He never gets more than 54% or so and never less than 47% or so. So it is always a reasonably close election. Silver’s calculations are obviously more complicated, but the basic idea of simulating elections is the same. 

Now, this might seem like a goofy way to come up with a “percent chance” with simulated elections and all. But it turns out it is actually a pretty important thing to know and relevant to those of us on the East Coast right now. It turns out weather forecasts (and projected hurricane paths) are based on the same sort of thing - simulated versions of the weather are run and the “percent chance of rain” is the fraction of times it rains in a particular place. 

So Romney may still win and Obama may lose - and Silver may still get a lot of it right. But regardless, the approach taken by Silver is not based on politics, it is based on statistics. Hopefully we can move away from politicizing statistical illiteracy and toward evaluating the models for the real, underlying assumptions they make. 

* In this case, we could calculate the percent of times Obama would win with a formula (called an analytical calculation) since we have simplified so much. In Nate’s case it is much more complicated, so you have to simulate. 

Computing for Data Analysis (Simply Statistics Edition)

As the entire East Coast gets soaked by Hurricane Sandy, I can’t help but think that this is the perfect time to…take a course online! Well, as long as you have electricity, that is. I live in a heavily tree-lined area and so it’s only a matter of time before the lights cut out on me (I’d better type quickly!). 

I just finished teaching my course Computing for Data Analysis through Coursera. This was my first experience teaching a course online and definitely my first experience teaching a course to > 50,000 people. There were definitely some bumps along the road, but the students who participated were fantastic at helping me smooth the way. In particular, the interaction on the discussion forums was very helpful. I couldn’t have done it without the students’ help. So, if you took my course over the past 4 weeks, thanks for participating!

Here are a couple quick stats on the course participation (as of today) for the curious:

  • 50,899: Number of students enrolled
  • 27,900: Number of users watching lecture videos
  • 459,927: Total number of streaming views (over 4 weeks)
  • 414,359: Total number of video downloads (not all courses allow this)
  • 14,375: Number of users submitting the weekly quizzes (graded)
  • 6,420: Number of users submitting the bi-weekly R programming assignments (graded)
  • 6393+3291: Total number of posts+comments to the discussion forum
  • 314,302: Total number of views in the discussion forum

I’ve received a number of emails from people who signed up in the middle of the course or after the course finished. Given that it was a 4-week course, signing up in the middle of the course meant you missed quite a bit of material. I will eventually be closing down the Coursera version of the course—at this point it’s not clear when it will be offered again on that platform but I would like to do so—and so access to the course material will be restricted. However, I’d like to make that material more widely available even if it isn’t in the Coursera format.

So I’m announcing today that next month I’ll be offering the Simply Statistics Edition of Computing for Data Analysis. This will be a slightly simplified version of the course that was offered on Coursera since I don’t have access to all of the cool platform features that they offer. But all of the original content will be available, including some new material that I hope to add over the coming weeks.

If you are interested in taking this course or know of someone who is, please check back here soon for more details on how to sign up and get the course information.

Tags: R MOOC

A statistical project bleg (urgent-ish)

We all know that politicians can play it a little fast and loose with the truth. This is particularly true in debates, where politicians have to think on their feet and respond to questions from the audience or from each other. 

Usually, we find out about how truthful politicians are in the “post-game show”. The discussion of the veracity of the claims is usually based on independent fact checkers such as PolitiFact. Some of these fact checkers (Politifact in particular) live-tweet their reports on many of the issues discussed during the debate. This is possible, since both candidates have a pretty fixed set of talking points they use, so it is near real time fact-checking. 

What would be awesome is if someone could write an R script that would scrape the live data off of Politifact’s Twitter account and create a truthfullness meter that looks something like CNN’s instant reaction graph (see #7) for independent voters. The line would show the moving average of how honest each politician was being. How cool would it be to show the two candidates and how truthful they are being? If you did this, tell me it wouldn’t be a feature one of the major news networks would pick up…

Why we are teaching massive open online courses (MOOCs) in R/statistics for Coursera

Editor’s Note: This post written by Roger Peng and Jeff Leek. 

A couple of weeks ago, we announced that we would be teaching free courses in Computing for Data Analysis and Data Analysis on the Coursera platform. At the same time, a number of other universities also announced partnerships with Coursera leading to a large number of new offerings. That, coupled with a new round of funding for Coursera, led to press coverage in the New York Times, the Atlantic, and other media outlets.

There was an ensuing explosion of blog posts and commentaries from academics. The opinions ranged from dramatic, to negative, to critical, to um…hilariously angry. Rafa posted a few days ago that many of the folks freaking out are missing the point - the opportunity to reach a much broader audience of folks with our course content. 

[Before continuing, we’d like to make clear that at this point no money has been exchanged between Coursera and Johns Hopkins. Coursera has not given us anything and Johns Hopkins hasn’t given them anything. For now, it’s just a mutually beneficial partnership — we get their platform and they get to use our content. In the future, Coursera will need to figure out a way to make money, and they are currently considering a number of options.] 

Now that the initial wave of hype has died down, we thought we’d outline why we are excited about participating in Coursera. We think it is only fair to start by saying this is definitely an experiment. Coursera is a newish startup and as such is still figuring out its plan/business model. Similarly, our involvement so far has been a little whirlwind and we haven’t actually taught courses yet, and we are happy to collect data and see how things turn out. So ask us again in 6 months when we are both done teaching.

But for now, this is why we are excited.

  1. Open Access. As Rafa alluded to in his post, this is an opportunity to reach a broad and diverse audience. As academics devoted to open science, we also think that opening up our courses to the biggest possible audience is, in principle, a good thing. That is why we are both basing our courses on free software and teaching the courses for free to anyone with an internet connection. 
  2. Excitement about statistics. The data revolution means that there is a really intense interest in statistics right now. It’s so exciting that Joe Blitzstein’s stat class on iTunes U has been one of the top courses on that platform. Our local superstar John McGready has also put his statistical reasoning course up on iTunes U to a similar explosion of interest. Rafa recently put his statistics for genomics lectures up on Youtube and they have already been viewed thousands of times. As people who are super pumped about the power and importance of statistics, we want to get in on the game. 
  3. We work hard to develop good materials. We put effort into building materials that our students will find useful. We want to maximize the impact of these efforts. We have over 30,000 students enrolled in our two courses so far. 
  4. It is an exciting experiment. Online teaching, including very very good online teaching, has been around for a long time. But the model of free courses at incredibly large scale is actually really new. Whether you think it is a gimmick or something here to stay, it is exciting to be part of the first experimental efforts to build courses at scale. Of course, this could flame out. We don’t know, but that is the fun of any new experiment. 
  5. Good advertising. Every professor at a research school is a start-up of one. This idea deserves it’s own blog post. But if you accept that premise, to keep the operation going you need good advertising. One way to do that is writing good research papers, another is having awesome students, a third is giving talks at statistical and scientific conferences. This is an amazing new opportunity to showcase the cool things that we are doing. 
  6. Coursera built some cool toys. As statisticians, we love new types of data. It’s like candy. Coursera has all sorts of cool toys for collecting data about drop out rates, participation, discussion board answers, peer review of assignments, etc. We are pretty psyched to take these out for a spin and see how we can use them to improve our teaching.
  7. Innovation is going to happen in education. The music industry spent years fighting a losing battle over music sharing. Mostly, this damaged their reputation and stopped them from developing new technology like iTunes/Spotify that became hugely influential/profitable. Education has been done the same way for hundreds (or thousands) of years. As new educational technologies develop, we’d rather be on the front lines figuring out the best new model than fighting to hold on to the old model. 

Finally, we’d like to say a word about why we think in-person education isn’t really threatened by MOOCs, at least for our courses. If you take one of our courses through Coursera you will get to see the lectures and do a few assignments. We will interact with students through message boards, videos, and tutorials. But there are only 2 of us and 30,000 people registered. So you won’t get much one on one interaction. On the other hand, if you come to the top Ph.D. program in biostatistics and take Data Analysis, you will now get 16 weeks of one-on-one interaction with Jeff in a classroom, working on tons of problems together. In other words, putting our lectures online now means at Johns Hopkins you get the most qualified TA you have ever had. Your professor. 

Really Big Objects Coming to R

I noticed in the development version of R the following note in the NEWS file:

There is a subtle change in behaviour for numeric index values 2^31 and larger.  These used never to be legitimate and so were treated as NA, sometimes with a warning.  They are now legal for long vectors so there is no longer a warning, and x[2^31] <- y will now extend the vector on a 64-bit platform and give an error on a 32-bit one.

This is significant news indeed!

Some background: In the old days, when most us worked on 32-bit machines, objects in R were limited to be about 4GB in size (and practically a lot less) because memory addresses were indexed using 32 bit numbers. When 64-bit machines became more common in the early 2000s, that limit was removed. Objects could theoretically take up more memory because of the dramatically larger address space. For the most part, this turned out to be true, although there were some growing pains as R was transitioned to be runnable on 64-bit systems (I remember many of those pains).

However, even with the 64-bit systems, there was a key limitation, which is that vectors, one of the fundamental objects in R, could only have a maximum of 2^31-1 elements, or roughly 2.1 billion elements. This was because array indices in R were stored internally as signed integers (specifically as ‘R_len_t’), which are 32 bits on most modern systems (take a look at .Machine$integer.max in R).

You might think that 2.1 billion elements is a lot, and for a single vector it still is. But you have to consider the fact that internally R stores all arrays, no matter how many dimensions there are, as just long vectors. So that would limit you, for example, to a square a matrix that was no bigger than roughly 46,000 by 46,000. That might have seemed like a large matrix back in 2000 but it seems downright quaint now. And if you had a 3-way array, the limit gets even smaller. 

Now it appears that change is a comin’. The details can be found in the R source starting at revision 59005 if you follow on subversion. 

A new type called ‘R_xlen_t’ has been introduced with a maximum value of 4,503,599,627,370,496, which is 2^52. As they say where I grew up, that’s a lot of McNuggets. So if your computer has enough physical memory, you will soon be able to index vectors (and matrices) that are significantly longer than before.

Tags: R

A closer look at data suggests Johns Hopkins is still the #1 US hospital

The US News best hospital 2012-20132 rankings are out. The big news is that Johns Hopkins has lost its throne. For 21 consecutive years Hopkins was ranked #1, but this year Mass General Hospital (MGH) took the top spot displacing Hopkins to #2. However, Elisabet Pujadas, an MD-PhD student here at Hopkins, took a close look at the data used for the rankings and made this plot (by hand!). The plot shows histograms of the rankings by speciality and shows Hopkins outperforming MGH.

I reproduced Elisabet’s figure using R (see plot on the left above… hers is way cooler). A quick look at the histograms shows that Hopkins has many more highly ranked specialities. For example, Hopkins has 5 specialities ranked as #1 while MGH has none. Hopkins has 2 specialities ranked #2 while MGH has none. The median rank for Hopkins is 3 while for MGH it’s 5. The plot on the right plots ranks, Hopkins’ versus MGH’s, and shows that Hopkins has a better ranking for 13 out of 16 specialities considered.

So how does MGH get ranked higher than Hopkins? Here U.S. News’ explanation of how they rank: 

To make the Honor Roll, a hospital had to earn at least one point in each of six specialties. A hospital earned two points if it ranked among the top 10 hospitals in America in any of the 12 specialties in which the US News rankings are driven mostly by objective data, such as survival rates and patient safety. Being ranked in the next 10 in those specialties earned a hospital one point. In the other four specialties, where ranking is based on each hospital’s reputation among doctors who practice that specialty, the top five hospitals in the country received two Honor Roll points and the next five got one point.

This actually results in a tie of 30 points, but according to the table here, Hopkins was ranked in 15 specialities to MGH’s 16. This was the tiebreaker. But, the data they put up shows Hopkins ranked in all 16 specialities. Did the specialty ranked 17th do Hopkins in? In any case, a closer look at the data does suggest Hopkins is still #1.

Disclaimer: I am a professor at Johns Hopkins University _______________________________________________

The data for Hopkins is here and I cleaned it up and put it here. For MGH it’s here and here. The script used to make the plots is here. Thanks to Elisabet for the pointer and data.

Johns Hopkins Coursera Statistics Courses

Computing for Data Analysis

Data Analysis

Mathematical Biostatistics Bootcamp

Tags: R

This graph shows that President Obama’s proposed budget treats the NIH even worse than G.W. Bush - Sign the petition to increase NIH funding!

The NIH provides financial support for a large percentage of biological and medical research in the United States. This funding supports a large number of US jobs, creates new knowledge, and improves healthcare for everyone. So I am signing this petition


NIH funding is essential to our national research enterprise, to our local economies, to the retention and careers of talented and well-educated people, to the survival of our medical educational system, to our rapidly fading worldwide dominance in biomedical research, to job creation and preservation, to national economic viability, and to our national academic infrastructure.


The current administration is proposing a flat $30.7 billion FY 2013 NIH budget. The graph below (left) shows how small the NIH budget is in comparison to the Defense and Medicare budgets in absolute terms. The difference between the administration’s proposal and the petition’s proposal ($33 billion) are barely noticeable. 

The graph on the right shows how in 2003 growth in the NIH budget fell dramatically while medicare and military spending kept growing. However, despite the decrease in rate, the NIH budget did continue to increase under Bush. If we follow Bush’s post 2003 rate (dashed line), the 2013 budget will be about what the petition asks for: $33 billion.  


If you agree that the relatively modest increase in the NIH budget is worth the incredibly valuable biological, medical, and economic benefits this funding will provide, please consider signing the petition before April 15 

A plot of my citations in Google Scholar vs. Web of Science

There has been some discussion about whether Google Scholar or one of the proprietary software companies numbers are better for citation counts. I personally think Google Scholar is better for a number of reasons:

  1. Higher numbers, but consistently/adjustably higher :-)
  2. It’s free and the data are openly available. 
  3. It covers more ground (patents, theses, etc.) to give a better idea of global impact
  4. It’s easier to use

I haven’t seen a plot yet relating Web of Science citations to Google Scholar citations, so I made one for my papers.

GS has about 41% more citations per paper than Web of Science. That is consistent with what other people have found. It also looks reasonably linearish. I wonder what other people’s plots would look like? 

Here is the R code I used to generate the plot (the names are Pubmed IDs for the papers):

library(ggplot2)

names = c(16141318,16357033,16928955,17597765,17907809,19033188,19276151,19924215,20560929,20701754,20838408, 21186247,21685414,21747377,21931541,22031444,22087737,22096506,22257669) 

y = c(287,122,84,39,120,53,4,52,6,33,57,0,0,4,1,5,0,2,0)

x = c(200,92,48,31,79,29,4,51,2,18,44,0,0,1,0,2,0,1,0)

Year = c(2005,2006,2007,2007,2007,2008,2009,2009,2011,2010,2010,2011,2012,2011,2011,2011,2011,2011,2012)

q <- qplot(x,y,xlim=c(-20,300),ylim=c(-20,300),xlab=”Web of Knowledge”,ylab=”Google Scholar”) + geom_point(aes(colour=Year),size=5) + geom_line(aes(x = y, y = y),size=2)