Pro Tips for Grad Students in Statistics/Biostatistics (Part 1)

I just finished teaching a Ph.D. level applied statistical methods course here at Hopkins. As part of the course, I gave one “pro-tip” a day; something I wish I had learned in graduate school that has helped me in becoming a practicing applied statistician. Here are the first three, more to come soon. 
  1. A major component of being a researcher is knowing what’s going on in the research community. Set up an RSS feed with journal articles. Google Reader is a good one, but there are others. Here are some good applied stat journals: Biostatistics, Biometrics, Annals of Applied Statistics…
  2. Reproducible research is a hot topic, in part because a couple of high-profile papers that were disastrously non-reproducible (see “Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology”). When you write code for statistical analysis try to make sure that: (a) It is neat and well-commented - liberal and specific comments are your friend. (b)That it can be run by someone other than you, to produce the same results that you report.
  3. In data analysis - particularly for complex high-dimensional
    data - it is frequently better to choose simple models for clearly defined parameters. With a lot of data, there is a strong temptation to go overboard with statistically complicated models; the danger of overfitting/ over-interpreting is extreme. The most reproducible results are often produced by sensible and statistically “simple” analyses (Note: being sensible and simple does not always lead to higher prole results).

figshare and don’t trust celebrities stating facts

A couple of links:

  1. figshare is a site where scientists can share data sets/figures/code. One of the goals is to encourage researchers to share negative results as well. I think this is a great idea - I often find negative results and this could be a place to put them. It also uses a tagging system, like Flickr. I think this is a great idea for scientific research discovery. They give you unlimited public space and 1GB of private space. This could be big, a place to help make reproducible research efforts user-friendly. Via TechCrunch
  2. Don’t trust celebrities stating facts because they usually don’t know what they are talking about. I completely agree with this. Particularly because I have serious doubts about the statisteracy of most celebrities. Nod to Alex for the link (our most active link finder!).  

Where do you get your data?

Here’s a question I get fairly frequently from various types of people: Where do you get your data? This is sometimes followed up quickly with “Can we use some of your data?”

My contention is that if someone asks you these questions, start looking for the exits.

Read More

Reproducible Research in Computational Science

First of all, thanks to Rafa for scooping me with my own article. Not sure if that’s reverse scooping or recursive scooping or….

The latest issue of Science has a special section on Data Replication and Reproducibility. As part of the section I wrote a brief commentary on the need for reproducible research in computational science. Science has a pretty tight word limit for it’s commentaries and so it was unfortunately necessary to omit a number of relevant topics.

The editorial introducing the special section, as well as a separate editorial in the same issue, seem to emphasize the errors/fraud angle. This might be because Science has once or twice been at the center of instances of scientific fraud. But as I’ve said previously (and a point I tried to make in the commentary), reproducibility is not needed soley to prevent fraud, although that is an important objective. Another important objective is getting ideas across and disseminating knowledge. I think this second objective often gets lost because there’s a sense that knowledge dissemination already happens and that it’s the errors that are new and interesting. While the errors are perhaps new, there is a problem of ideas not getting across as quickly as they could because of a lack of code and/or data. The lack of published code/data is arguably holding up the advancement of science (if not Science).

One important idea I wanted to get across was that we can ramp up to achieve the ideal scenario, if getting there immediately is not possible. People often get hung up on making the data available but I think a substantial step could be made by simply making code available. Why doesn’t every journal just require it? We don’t have to start with a grand strategy involving funding agencies and large consortia. We can start modestly and make useful improvements

A final interesting question that came up as the issue was going to press was whether I was talking about “reproducibility” or “replication”. As I made clear in the commentary, I define “replication” as independent people going out and collecting new data and “reproducibility” as independent people analyzing the same data. Apparently, others have the reverse definitions for the two words. The confusion is unfortunate because one idea has a centuries long history whereas the importance of the other idea has only recently become relevant. I’m going to stick to my guns here but we’ll have to see how the language evolves.

Reproducible Research and Turkey

Over the Thanksgiving recent break I naturally started thinking about reproducible research in between salting the turkey and making the turkey stock. Clearly, these things are all related. 

Read More