Pro-tips for graduate students (Part 4)

This is part of the ongoing series of pro tips for graduate students, check out parts onetwo and three for the original installments. 

  1. You can never underestimate how little your audience knows/cares about what you are talking about (so be clear and start with the “why”).
  2. Perfect is the enemy of good (so do something good and perfect it later).
  3. Learn about as many different areas as you can. You have to focus on one problem to get a Ph.D. (your dissertation) but the best way to get new ideas is to talk to people in areas with different problems than you have. This is the source of many of the “Big Impact” papers. Resources for talking about new ideas ranked according to formality: seminar, working groups, meeting with faculty/other students, going for a beer with some friends.
  4. Here are some ways to come up with a new method: (i) create a new method for a new data type, (ii) adapt an old/useful method to a new data type, (iii) an overlooked problem, (iv) changing the assumptions of a current method, and (v) generalizing a known method. Any can be impactful, but the highest probability of high impact in my experience is (ii). 

Sunday Data/Statistics Link Roundup (10/21/12)

  1. This is scientific variant on the #whatshouldwecallme meme isn’t exclusive to statistics, but it is hilarious. 
  2. This is a really interesting post that is a follow-up to the XKCD password security comic. The thing I find most interesting about this is that researchers realized the key problem with passwords was that we were looking at them purely from a computer science perspective. But people use passwords, so we need a person-focused approach to maximize security. This is a very similar idea to our previous post on an experimental foundation for statistics. Looks like Di Cook and others are already way ahead of us on this idea. It would be interesting to redefine optimality incorporating the knowledge that most of the time it is a person running the statistics. 
  3. This is another fascinating article about the math education wars. It starts off as the typical dueling schools issue in academia - two different schools of thought who routinely go after the other side. But the interesting thing here is it sounds like one side of this math debate is being waged by a person collecting data and the other is being waged by a side that isn’t. It is interesting how many areas are being touched by data - including what kind of math we should teach. 
  4. I’m going to visit Minnesota in a couple of weeks. I was so pumped up to be an outlaw. Looks like I’m just a regular law abiding citizen though….
  5. Here are outstanding summaries of what went on at the Carl Morris Big Data conference this last week. Tons of interesting stuff there. Parts one, two, and three

This is an awesome paper all students in statistics should read

The paper is a review of how to do software development for academics. I saw it via C. Titus Brown (who we have interviewed), he is also a co-author. How to write software (particularly for other people) is something that is under emphasized in many curricula. But it turns out this is also one of the more important components of disseminating your work in modern applied statistics. My only wish is that there was an accompanying website with resources/links for people to chase down. 

Sunday Data/Statistics Link Roundup (9/9/12)

  1. Not necessarily statistics related, but pretty appropriate now that the school year is starting. Here is a little introduction to "how to google" (via Andrew J.). Being able to “just google it” and find answers for oneself without having to resort to asking folks is maybe the #1 most useful skill as a statistician. 
  2. A really nice presentation on interactive graphics with the googleVis package. I think one of the most interesting things about the presentation is that it was built with markdown/knitr/slidy (see slide 53). I am seeing more and more of these web-based presentations. I like them for a lot of reasons (ability to incorporate interactive graphics, easy sharing, etc.), although it is still harder than building a Powerpoint. I also wonder, what happens when you are trying to present somewhere that doesn’t have a good internet connection?
  3. We talked a lot about the ENCODE project this week. We had an interview with Steven Salzberg, then Rafa followed it up with a discussion of top-down vs. bottom-up science. Tons of data from the ENCODE project is now available, there is even a virtual machine with all the software used in the main analysis of the data that was just published. But my favorite quote/tweet/comment this week came from Leonid K. about a flawed/over the top piece trying to make a little too much of the ENCODE discoveries: “that’s a clown post, bro”.
  4. Another breathless post from the Chronicle about how there are “dozens of plagiarism cases being reported on Coursera”. Given that tens of thousands of people are taking the course, it would be shocking if there wasn’t plagiarism, but my guess is it is about the same rate you see in in-person classes. I will be using peer grading in my course, hopefully plagiarism software will be in place by then. 
  5. A New York Times article about a new book on visualizing data for scientists/engineers. I love all the attention data visualization is getting. I’ll take a look at the book for sure. I bet it says a lot of the same things Tufte said and a lot of the things Nathan Yau says in his book. This one may just be targeted at scientists/engineers. (link via Dan S.)
  6. Edo and co. are putting together a workshop on the analysis of social network data for NIPS in December. If you do this kind of stuff, it should be a pretty awesome crowd, so get your paper in by the Oct. 15th deadline!

A non-exhaustive list of things I have failed to accomplish

A few years ago I stumbled across a blog post that described a person’s complete cv. The idea was that the cv listed both the things they had accomplished and the things they had failed to accomplish. At the time, it really helped me to see that to be successful you have to be willing to fail over and over. 

I use my website to show the things I have accomplished career-wise. But I have also failed to achieve a lot of the things I set out to do. The reason was that there was strong competition for the awards/positions I was up for and other deserving people got them.   

  1. Applied to MIT undergrad in 1999 - rejected
  2. Donovan J. Thompson Award 2001 - did not receive
  3. Applied for Barry Goldwater scholarship 2002 - rejected
  4. Applied for NSF Pre-Doctoral Fellowship 2003 - rejected
  5. Applied for graduate school in math at MIT 2003, rejected
  6. One of my first 3 papers rejected at PLoS Biology 2005
  7. Many subsequent rejections of papers - too many to list exhaustively but here is one example
  8. Applied for Youden Award 2010 - rejected
  9. Applied for Microsoft Faculty Fellowship 2012 - rejected
  10. Applied for Sloan Fellowship 2012 - rejected
  11. Many grants have been rejected, again too long to list exhaustively 

On the relative importance of mathematical abstraction in graduate statistical education

Editor’s Note: This is the counterpoint in our series of posts on the value of abstraction in graduate education. See Brian’s defense of abstraction on Monday and the comments on his post, as well as the comments on our original teaser post for more. See below for a full description of the T-bone inside joke*.

Brian did a good job at defining abstraction. In a cagey debater’s move, he provided an incredibly broad definition of abstraction that includes the reason we call a :-) a smiley face, the reason why we can apply least squares to a variety of data types, and the reason we write functions when programming. At this very broad level, it is clear that abstract thinking is necessary for graduate students or any other data professional.

But our debate was inspired by a discussion of whether measure-theoretic probability was a key component of our graduate program. There was some agreement that for many biostatistics Ph.D. students, this exact topic may not be necessary for their research or careers. Brian suggested that measure-theoretic probability was a surrogate marker for something more important - abstract thinking and the ability to generalize ideas. This is a very specific form of generalization and abstraction that is used most commonly by statisticians: the ability that permits one to prove theorems and develop statistical models that can be applied to a variety of data types. I will therefore refocus the debate on the original topic. I have three main points:

  1. There is an over emphasis in statistical graduate programs on abstraction defined as the ability to prove mathematical theorems and develop general statistical methods.
  2. It is possible to create incredible statistical value without developing generalizable statistical methods
  3. While abstraction as defined generally is good, overemphasis on this specific type of abstraction limits our ability to include computing and real data analysis in our curriculum. It also takes away from the most important learning experience of graduate school: performing independent research.


There is an over emphasis in statistical graduate programs on abstraction defined as the ability to prove mathematical theorems and develop general statistical methods.

At a top program, you can expect to take courses in very theoretical statistics, measure theoretic probability, and an applied (or methods) sequence. The first two courses are exclusively mathematical. The third (at the programs I have visited, graduated from, taught in), despite its name, is most generally focused on mathematical details underlying statistical methods. The result is that most Ph.D. students are heavily trained in the mathematical theory behind statistics.

At the same time, there are a long list of skills necessary to develop a successful Ph.D. statistician. These include creativity in applications, statistical programming skills, grit to power through the boring/hard parts of research, interpretation of statistical results on real data, ability to identify the most important scientific problems, and a deep understanding of the scientific problems you are working on. Abstraction is on that list, but it is just one of many skills on that list. Graduate education is a zero-sum game over a finite period of time. Our strong focus on mathematical abstraction means there is less time for everything else.

Any hard quantitative course will measure the ability of a student to abstract in the general sense Brian defined. One of these courses would be very useful for our students. But it is not clear that we should focus on mathematical abstraction to the exclusion of other important characteristics of graduate students.

It is possible to create incredible statistical value without developing generalizable statistical methods

A major standard for success in academia is the ability to generate solutions to problems that are widely read, cited, and used. A graduate student who produces these types of solutions is likely to have a high-impact and well-respected career. In general, it is not necessary to be able to prove theorems, understand measure theory, or develop generalizable statistical models to have this type of success.

One example is one of the co-authors of our blog, best known for his work in genomics. In this field, data is noisy and full of systematic errors, and for several technologies, he invented methods to correct them. For example, he developed the most popular method for making measurements from different experiments comparable, for removing the dependence of measurements on the letters in a gene, and for reducing variability due to operators who run the machine or the ozone levels. Each of these discoveries involved: (1) deep understanding of the specific technology used, (2) a good intuition of what signals were due to biology and which were due to technology, (3) application/development of specific, somewhat ad-hoc, statistical procedures to correct the mistakes, and (4) the development and distribution of good software. His work has been hugely influential on genomics, has been cited thousands of times, and has substantially improved the quality of both biological and statistical results.

But the work did not result in knowledge that was generalizable to other areas of application, it deals with problems that are highly specialized to genomics. If these were his only contributions (they are not), he’d be a hugely successful Ph.D. statistician. But had he focused on general solutions he would have never solved the problems at hand, since the problems were highly specific to a single application. And this is just one example I know well because I work in the area. There are a ton more just like it.

While abstraction as defined generally is good, overemphasis on a specific type of abstraction limits our ability to include computing and real data analysis in our curriculum. It also takes away from the most important learning experience of graduate school: performing independent research.

One could argue that the choice of statistical techniques during data analysis is abstraction, or that one needs to abstract to develop efficient software. But the ability to abstract needed for these tasks can be measured by a wide range of classes, not just measure theoretic probability. Some of these classes might teach practically applicable skills like writing fast and efficient algorithms. Many results of high statistical value do not require mathematical proofs, abstract inductive reasoning, or asymptotic theory. It is a good idea to have a some people who can abstract away the science behind statistical methods to the core mathematical philosophy. But our current curriculum is too heavily weighted in this direction. In some cases, statisticians are even being left behind because they do not have sufficient time in their curriculum to develop the computational skills and amass the necessary subject matter knowledge needed to compete with the increasingly diverse set of engineers, computer scientists, data scientists, and computational biologists tackling the same scientific problems.

We need to reserve a larger portion of graduate education for diving deeply into specific scientific problems, even if it means they spend less time developing generalizable/abstract statistical ideas.

* Inside joke explanation: Two years ago at JSM I ran a footrace with this guy for the rights to the name “Jeff” in the department of Biostatistics at Hopkins for the rest of 2011. Unfortunately, we did not pro-rate for age and he nipped me by about a half-yard. True to my word, I went by Tullis (my middle name) for a few months, including on the title slide of my JSM talk. This was, of course, immediately subjected to all sorts of nicknaming and B-Caffo loves to use “T-bone”. I apologize on behalf of those that brought it up.

Online education: many academics are missing the point

Many academics are complaining about online education and warning us about how it can lead to a lower quality product. For example, the New York Times recently published this op-ed piece wondering if “online education [will] ever be education of the very best sort?”. Although pretty much every controlled experiment comparing online and in-class education finds that students learn just about the same under both approaches, I do agree that in-person lectures are more enjoyable to both faculty and students. But who cares? My enjoyment and the enjoyment of the 30 privileged students that physically sit in my classes seems negligible compared to the potential of reaching and educating thousands of students all over the world.  Also, using recorded lectures will free up time that I can spend on one-on-one interactions with tuition paying students.  But what most excites me about online education is the possibility of being part of the movement that redefines existing disciplines as the number of people learning grows by orders of magnitude. How many Ramanujans are out there eager to learn Statistics? I would love it if they learned it from me. 

Sunday data/statistics link roundup (6/24)

  1. We’ve got a new domain! You can still follow us on tumblr or here: http://simplystatistics.org/
  2. A cool article on MIT’s annual sports statistics conference (via @storeylab). I love how the guy they chose to highlight created what I would consider a pretty simple visualization with known tools - but it turns out it is potentially a really new way of evaluating the shooting range of basketball players. This is my favorite kind of creativity in statistics.
  3. This is an interesting article calling higher education a “credentials cartel”. I don’t know if I’d go quite that far; there are a lot of really good reasons for higher education institutions beyond credentialing like research, putting smart students together in classes and dorms, broadening experiences etc. But I still think there is room for a smart group of statisticians/computer scientists to solve the credentialing problem on a big scale and have a huge impact on the education industry. 
  4. Check out John Cook’s conjecture on statistical methods that get used: “The probability of a method being used drops by at least a factor of 2 for every parameter that has to be determined by trial-and-error.” I’m with you. I wonder if there is a corollary related to how easy the documentation is to read? 
  5. If you haven’t read Roger’s post on Statistics and the Science Club, I consider it a must-read for anyone who is affiliated with a statistics/biostatistics department. We’ve had feedback by email/on twitter from other folks who are moving toward a more science oriented statistical culture. We’d love to hear from more folks with this same attitude/inclination/approach. 

An essay on why programmers need to learn statistics

This is awesome. There are a few places with some strong language, but overall I think the message is pretty powerful. Via Tariq K. I agree with Tariq, one of the gems is:

If you want to measure something, then don’t measure other sh**. 

Statistics project ideas for students

Here are a few ideas that might make for interesting student projects at all levels (from high-school to graduate school). I’d welcome ideas/suggestions/additions to the list as well. All of these ideas depend on free or scraped data, which means that anyone can work on them. I’ve given a ballpark difficulty for each project to give people some idea.

Happy data crunching!

Data Collection/Synthesis

  1. Creating a webpage that explains conceptual statistical issues like randomization, margin of error, overfitting, cross-validation, concepts in data visualization, sampling. The webpage should not use any math at all and should explain the concepts so a general audience could understand. Bonus points if you make short 30 second animated youtube clips that explain the concepts. (Difficulty: Lowish; Effort: Highish)
  2. Building an aggregator for statistics papers across disciplines that can be the central resource for statisticians. Journals ranging from PLoS Genetics to Neuroimage now routinely publish statistical papers. But there is no one central resource that aggregates all the statistics papers published across disciplines. Such a resource would be hugely useful to statisticians. You could build it using blogging software like Wordpress so articles could be tagged/you could put the resource in your RSS feeder. (Difficulty: Lowish; Effort: Mediumish)

Data Analyses

  1. Scrape the LivingSocial/Groupon sites for the daily deals and develop a prediction of how successful the deal will be based on location/price/type of deal. You could use either the RCurl R package or the XML R package to scrape the data. (Difficulty: Mediumish; Effort: Mediumish)
  2. You could use the data from your city (here are a few cities with open data) to: (a) identify the best and worst neighborhoods to live in based on different metrics like how many parks are within walking distance, crime statistics, etc. (b) identify concrete measures your city could take to improve different quality of life metrics like those described above - say where should the city put a park, or (c) see if you can predict when/where crimes will occur (like these guys did). (Difficulty: Mediumish; Effort: Highish)
  3. Download data on state of the union speeches from here and use the tm package in R to analyze the patterns of word use over time (Difficulty: Lowish; Effort: Lowish)
  4. Use this data set from Donors Choose to determine the characteristics that make the funding of projects more likely. You could send your results to the Donors Choose folks to help them improve the funding rate for their projects. (Difficulty: Mediumish; Effort: Mediumish
  5. Which basketball player would you want on your team? Here is a really simple analysis done by Rafa. But it doesn’t take into account things like defense. If you want to take on this project, you should take a look at this Denis Rodman analysis which is the gold standard. (Difficulty: Mediumish; Effort: Highish).

Data visualization

  1. Creating an R package that wraps the svgAnnotation package. This package can be used to create dynamic graphics in R, but is still a bit too flexible for most people to use. Writing some wrapper functions that simplify the interface would be potentially high impact. Maybe something like svgPlot() to create simple, dynamic graphics with only a few options (Difficulty: Mediumish; Effort: Mediumish). 
  2. The same as project 1 but for D3.js. The impact could potentially be a bit higher, since the graphics are a bit more professional, but the level of difficulty and effort would also both be higher. (Difficulty: Highish; Effort: Highish)