Some Thoughts on Teaching R to 50,000 Students

Two weeks ago I finished teaching my course Computing for Data Analysis through Coursera. Since then I’ve had some time to think about how it went, what I learned, and what I’d do differently.

First off, let me say that it was a lot of fun. Seeing thousands of people engaged in the material you’ve developed is an incredible experience and unlike any I’ve seen before. I initially had a number of fears about teaching this course, the primary one being that it would be a lot of work. Managing the needs of 50,000 students seemed like it would be a nightmare and making sure everything worked for every single person seemed impossible.

These fears were ultimately unfounded. The Coursera platform is quite nice and is well-designed to scale to very large MOOCs. Everything is run off of Amazon S3 and so scalability is not an issue (although Hurricanes are a different story!) and there are numerous tools provided to help with automatic grading. Quizzes were multiple choice for me, so that gave instant feedback to students, but there are options to grade via regular expressions. For programming assignments, grading was done via unit tests, so students would feed pre-selected inputs into their R functions and the output would be checked on the Coursera server. Again, this allowed for automatic instant feedback without any intervention on my part. Designing programming assignments that would be graded by unit tests was a bit restrictive for me, but I think that was mostly because I wasn’t that used to it. On my end, I had to learn about video editing and screen capture, which wasn’t too bad. I mostly used Camtasia for Mac (highly recommended) for the lecture videos and occasionally used Final Cut Pro X.

Coursera is working hard on their platform and so I imagine there will be many improvements in the near future (some of which were actually rolled out as the course was running). The system feels like it was designed and written by a bunch of Stanford CS grad students—and lo and behold it was! I think it’s a great platform for teaching computing, but I don’t know how well it’ll work for, say, Modern Poetry. But we’ll see, I guess.

Here is some of what I took away from this experience:

  • 50,000 students is in some ways easier than 50 students. When I teach my in-class version of this course, I try to make sure everyone’s keeping up and doing well. I learn everyone’s names. I read all their homeworks. With 50,000 students there’s no pretension about individual attention. Everyone’s either on their own or has to look to the community for help. I did my best to participate in the discussion forums, but the reality was that the class community was incredibly helpful and participating in it was probably a better experience for some students than just having me to talk to.
  • Clarity and specificity are necessary. I’ve never taught a course online before, so I was used to the way I create assignments in-class. I just jot down some basic goals and problems and then clarify things in class if needed. But here, the programming assignments really had to be clear (akin to legal documents) because trying to clear up confusion afterwards often led to more confusion. The result is that it took a lot more time to write homework assignments for this class than for the same course in-person (even if it was the same homework) because I was basically writing a software specification.
  • Modularity is key to overcoming heterogeneity. This was a lesson that I didn’t figure out until the middle of the course when it was basically too late. In any course, there’s heterogeneity in the backgrounds of the students. In programming classes, some students have programmed in other languages before while some have never programmed at all. Handling heterogeneity is a challenge in any course. Now, just multiply that by 10,000 and that’s what this course was. Breaking everything down into very small pieces is key to letting people across the skill spectrum move at their own pace. I thought I’d done this but in reality I hadn’t broken things down into small enough pieces. The result was that the first homework was a beast of a problem for those who had little programming experience. 
  • Time and content are more loosely connected. Preparing for this course exposed a feature of in-class courses that I’d not thought about. In-class courses for me are very driven by the clock and the calendar. I teach twice a week, each period is 1.5 hours, and there are 8 weeks in the term. So I need to figure out how to fit material into exact 1.5 hour blocks. If something only takes 1 hour to cover then I need to cover part of the next topic, find a topic that’s short, or just fill for half an hour. While preparing for this course, I found myself just thinking about what content I wanted to cover and just doing it. I tried to target about 2 hours of video per week, but there was obviously some flexibility. In class, there’s no flexibility because usually the next class is trampling over you as the period ends. Not having to think about exact time was very liberating.

I’m grateful for all the students I had in this first offering of the course I thank them for putting up with my own learning process as I taught it. I’m hoping to offer this course again on Coursera but I’m not sure when that’ll be. If you missed the Coursera version of Computing for Data Analysis, I will be offering a version of this course through the blog very shortly. Please check here back for details.

Tags: R MOOC

Computing for Data Analysis (Simply Statistics Edition)

As the entire East Coast gets soaked by Hurricane Sandy, I can’t help but think that this is the perfect time to…take a course online! Well, as long as you have electricity, that is. I live in a heavily tree-lined area and so it’s only a matter of time before the lights cut out on me (I’d better type quickly!). 

I just finished teaching my course Computing for Data Analysis through Coursera. This was my first experience teaching a course online and definitely my first experience teaching a course to > 50,000 people. There were definitely some bumps along the road, but the students who participated were fantastic at helping me smooth the way. In particular, the interaction on the discussion forums was very helpful. I couldn’t have done it without the students’ help. So, if you took my course over the past 4 weeks, thanks for participating!

Here are a couple quick stats on the course participation (as of today) for the curious:

  • 50,899: Number of students enrolled
  • 27,900: Number of users watching lecture videos
  • 459,927: Total number of streaming views (over 4 weeks)
  • 414,359: Total number of video downloads (not all courses allow this)
  • 14,375: Number of users submitting the weekly quizzes (graded)
  • 6,420: Number of users submitting the bi-weekly R programming assignments (graded)
  • 6393+3291: Total number of posts+comments to the discussion forum
  • 314,302: Total number of views in the discussion forum

I’ve received a number of emails from people who signed up in the middle of the course or after the course finished. Given that it was a 4-week course, signing up in the middle of the course meant you missed quite a bit of material. I will eventually be closing down the Coursera version of the course—at this point it’s not clear when it will be offered again on that platform but I would like to do so—and so access to the course material will be restricted. However, I’d like to make that material more widely available even if it isn’t in the Coursera format.

So I’m announcing today that next month I’ll be offering the Simply Statistics Edition of Computing for Data Analysis. This will be a slightly simplified version of the course that was offered on Coursera since I don’t have access to all of the cool platform features that they offer. But all of the original content will be available, including some new material that I hope to add over the coming weeks.

If you are interested in taking this course or know of someone who is, please check back here soon for more details on how to sign up and get the course information.

Tags: R MOOC

Sunday Data/Statistics Link Roundup (9/23/12)

  1. Harvard Business school is getting in on the fun, calling the data scientist the sexy profession for the 21st century. Although I am a little worried that by the time it gets into a Harvard Business document, the hype may be outstripping the real promise of the discipline. Still, good news for statisticians! (via Rafa via Francesca D.’s Facebook feed). 
  2. The counterpoint is this article which suggests that data scientists might be able to be replaced by tools/software. I think this is also a bit too much hype for my tastes. Certain things will definitely be automated and we may even end up with a deterministic statistical machine or two. But there will continually be new problems to solve which require the expertise of people with data analysis skills and good intuition (link via Samara K.)
  3. A bunch of websites are popping up where you can sign up and have people take your online courses for you. I’m not going to give them the benefit of a link, but they aren’t hard to find these days. The thing I don’t understand is, if it is a free online course, why have someone else take it for you? It’s free, its in your spare time, and the bar for passing is pretty low (links via Sherri R. redacted)….
  4. Maybe mostly useful for me, but for other people with Tumblr blogs, here is a way to insert Latex.
  5. Brian Caffo shares his impressions of the SAMSI massive data workshop.  He raises an important issue which definitely deserves more discussion: should we be focusing on specific or general problems? Worth a read. 
  6. For the people into self-tracking, Chris V. points to an app created by the University of Indiana that lets people track their sexual activity. The most interesting thing about that app is how it highlights a key and I suppose often overlooked issue with analyzing self-tracking data. Despite the size of these data sets, they are still definitely biased samples. It’s only a brave few who will tell the University of Indiana all about their sex life. 

Sunday data/statistics link roundup (8/12/12)

  1. An interesting blog post about the top N reasons to do a Ph.D. in bioinformatics or computational biology. A couple of things that I find interesting and could actually be said of any program in biostatistics as well are: computing is the key skill of the 21st century and computational skills are highly transferrable. Via Andrew J. 
  2. Here is an interesting auto-complete map of the United States where the prompt was, “Why is [state] so”. It seems like using the Google auto-complete functions can lead to all sorts of humorous data, xkcd has used it as a data source a couple of times in the past. By the way, the person(s) who think Idaho is boring haven’t been to the right parts of Idaho. (via Rafa). 
  3. One of my all-time favorite statistics quotes appears in this column by David Brooks: “…what God hath woven together, even multiple regression analysis cannot tear asunder.” It seems like the perfect quote for any study that attempts to build a predictive model for a complicated phenomenon where only limited knowledge of the underlying mechanisms are known. 
  4. I’ve been reading up a lot on how to summarize and communicate risk. At the moment, I’ve been following a lot of David Spiegelhalter’s stuff, and really liked this 30,000 foot view summary.
  5. It is interesting how often you see R popping up in random places these days. Here is a blog post with some clearly R-created plots that appeared on Business Insider about predicting the stock-market. 
  6. Roger and I had a post on MOOC’s this week from the perspective of faculty teaching the courses. For a more departmental/administrative level view, be sure to re-read Rafa’s post on the future of graduate education

Why we are teaching massive open online courses (MOOCs) in R/statistics for Coursera

Editor’s Note: This post written by Roger Peng and Jeff Leek. 

A couple of weeks ago, we announced that we would be teaching free courses in Computing for Data Analysis and Data Analysis on the Coursera platform. At the same time, a number of other universities also announced partnerships with Coursera leading to a large number of new offerings. That, coupled with a new round of funding for Coursera, led to press coverage in the New York Times, the Atlantic, and other media outlets.

There was an ensuing explosion of blog posts and commentaries from academics. The opinions ranged from dramatic, to negative, to critical, to um…hilariously angry. Rafa posted a few days ago that many of the folks freaking out are missing the point - the opportunity to reach a much broader audience of folks with our course content. 

[Before continuing, we’d like to make clear that at this point no money has been exchanged between Coursera and Johns Hopkins. Coursera has not given us anything and Johns Hopkins hasn’t given them anything. For now, it’s just a mutually beneficial partnership — we get their platform and they get to use our content. In the future, Coursera will need to figure out a way to make money, and they are currently considering a number of options.] 

Now that the initial wave of hype has died down, we thought we’d outline why we are excited about participating in Coursera. We think it is only fair to start by saying this is definitely an experiment. Coursera is a newish startup and as such is still figuring out its plan/business model. Similarly, our involvement so far has been a little whirlwind and we haven’t actually taught courses yet, and we are happy to collect data and see how things turn out. So ask us again in 6 months when we are both done teaching.

But for now, this is why we are excited.

  1. Open Access. As Rafa alluded to in his post, this is an opportunity to reach a broad and diverse audience. As academics devoted to open science, we also think that opening up our courses to the biggest possible audience is, in principle, a good thing. That is why we are both basing our courses on free software and teaching the courses for free to anyone with an internet connection. 
  2. Excitement about statistics. The data revolution means that there is a really intense interest in statistics right now. It’s so exciting that Joe Blitzstein’s stat class on iTunes U has been one of the top courses on that platform. Our local superstar John McGready has also put his statistical reasoning course up on iTunes U to a similar explosion of interest. Rafa recently put his statistics for genomics lectures up on Youtube and they have already been viewed thousands of times. As people who are super pumped about the power and importance of statistics, we want to get in on the game. 
  3. We work hard to develop good materials. We put effort into building materials that our students will find useful. We want to maximize the impact of these efforts. We have over 30,000 students enrolled in our two courses so far. 
  4. It is an exciting experiment. Online teaching, including very very good online teaching, has been around for a long time. But the model of free courses at incredibly large scale is actually really new. Whether you think it is a gimmick or something here to stay, it is exciting to be part of the first experimental efforts to build courses at scale. Of course, this could flame out. We don’t know, but that is the fun of any new experiment. 
  5. Good advertising. Every professor at a research school is a start-up of one. This idea deserves it’s own blog post. But if you accept that premise, to keep the operation going you need good advertising. One way to do that is writing good research papers, another is having awesome students, a third is giving talks at statistical and scientific conferences. This is an amazing new opportunity to showcase the cool things that we are doing. 
  6. Coursera built some cool toys. As statisticians, we love new types of data. It’s like candy. Coursera has all sorts of cool toys for collecting data about drop out rates, participation, discussion board answers, peer review of assignments, etc. We are pretty psyched to take these out for a spin and see how we can use them to improve our teaching.
  7. Innovation is going to happen in education. The music industry spent years fighting a losing battle over music sharing. Mostly, this damaged their reputation and stopped them from developing new technology like iTunes/Spotify that became hugely influential/profitable. Education has been done the same way for hundreds (or thousands) of years. As new educational technologies develop, we’d rather be on the front lines figuring out the best new model than fighting to hold on to the old model. 

Finally, we’d like to say a word about why we think in-person education isn’t really threatened by MOOCs, at least for our courses. If you take one of our courses through Coursera you will get to see the lectures and do a few assignments. We will interact with students through message boards, videos, and tutorials. But there are only 2 of us and 30,000 people registered. So you won’t get much one on one interaction. On the other hand, if you come to the top Ph.D. program in biostatistics and take Data Analysis, you will now get 16 weeks of one-on-one interaction with Jeff in a classroom, working on tons of problems together. In other words, putting our lectures online now means at Johns Hopkins you get the most qualified TA you have ever had. Your professor.