An experimental foundation for statistics

In a recent conversation with Brian (of abstraction fame) about the relationship between mathematics and statistics. Statistics, for historical reasons, has been treated as a mathematical sub-discipline (this is the NSF’s view).

One reason statistics is viewed as a sub-discipline of math is because the foundations of statistics are built on the basis of deductive reasoning, where you start with a few general propositions or foundations that you assume to be true and then systematically prove more specific results. A similar approach is taken in most mathematical disciplines. 

In contrast, scientific disciplines like biology are largely built on the basis of inductive reasoning and the scientific method. Specific individual discoveries are described and used as a framework for building up more general theories and principles. 

So the question Brian and I had was: what if you started over and built statistics from the ground up on the basis of inductive reasoning and experimentation? Instead of making mathematical assumptions and then proving statistical results, you would use experiments to identify core principals. This actually isn’t without precedent in the statistics community. Bill Cleveland and Robert McGill studied how people perceive graphical information and produced some general recommendations about the use of area/linear contrasts, common axes, etc. There has also been a lot of work on experimental understanding of how humans understand uncertainty

So what if we put statistics on an experimental, rather than on a mathematical foundation. We performed experiments to see what kind of regression models people were able to interpret most clearly, what were the best ways to evaluate confounding/outliers, or what measure of statistical significance people understood best? Basically, what if the “quality” of a statistical method did not rest on the mathematics behind the method, but on the basis of experimental results demonstrating how people used the methods? So, instead of justifying lowess mathematically, we justified it on the basis of its practical usefulness through specific, controlled experiments. Some of this is already happening when people do surveys of the most successful methods in Kaggle contests or with the MAQC.

I wonder what methods would survive the change in paradigm?

A proposal for a really fast statistics journal

I know we need a new journal like we need a good poke in the eye. But I got fired up by the recent discussion of open science (by Paul Krugman and others) and the seriously misguided Research Works Act- that aimed to make it illegal to deposit published papers funded by the government in Pubmed central or other open access databases.

I also realized that I spend a huge amount of time/effort on the following things: (1) waiting for reviews (typically months), (2) addressing reviewer comments that are unrelated to the accuracy of my work - just adding citations to referees papers or doing additional simulations, and (3) resubmitting rejected papers to new journals - this is a huge time suck since I have to reformat, etc. Furthermore, If I want my papers to be published open-access I also realized I have to pay at minimum $1,000 per paper

So I thought up my criteria for an ideal statistics journal. It would be accurate, have fast review times, and not discriminate based on how interesting an idea is. I have found that my most interesting ideas are the hardest ones to get published.  This journal would:
  • Be open-access and free to publish your papers there. You own the copyright on your work. 
  • The criteria for publication would be: (1) it has to do with statistics, computation, or data analysis, (2) is the work is technically correct. 
  • We would accept manuals, reports of new statistical software, and full length research articles. 
  • There would be no page limits/figure limits. 
  • The journal would be published exclusively online. 
  • We would guarantee reviews within 1 week and publication immediately upon review if criteria (1) and (2) are satisfied
  • Papers would receive a star rating from the editor - 0-5 stars. There would be a place for readers to also review articles
  • All articles would be published with a tweet/like button so they can be easily distributed
To achieve such a fast review time, here is how it would work. We would have a large group of Associate Editors (hopefully 30 or more). When a paper was received, it would be assigned to an AE. The AEs would agree to referee papers within 2 days. They would use a form like this:
  • Review of: Jeff’s Paper
  • Technically Correct: Yes
  • About statistics/computation/data analysis: Yes
  • Number of Stars: 3 stars

  • 3 Strengths of Paper (1 required): 
  • This paper revolutionizes statistics 

  • 3 Weakness of Paper (1 required): 
  • * The proof that this paper revolutionizes statistics is pretty weak
  • because he only includes one example.
That’s it, super quick, super simple, so it wouldn’t be hard to referee. As long as the answers to the first two questions were yes, it would be published. 
So now here’s my questions: 
  1. Would you ever consider submitting a paper to such a journal?
  2. Would you be willing to be one of the AEs for such a journal? 
  3. Is there anything you would change? 

The 5 Most Critical Statistical Concepts

It seems like everywhere we look, data is being generated - from politics, to biology, to publishing, to social networks. There are also diverse new computational tools, like GPGPU and cloud computing, that expand the statistical toolbox. Statistical theory is more advanced than its ever been, with exciting work in a range of areas. 

With all the excitement going on around statistics, there is also increasing diversity. It is increasingly hard to define “statistician” since the definition ranges from very mathematical to very applied. An obvious question is: what are the most critical skills needed by statisticians? 

Read More

The future of graduate education

Stanford is offering a free online course and more than 100,000 students have registered. This got the blogosphere talking about the future of universities. Matt Yglesias thinks that “colleges are the next newspaper and are destined for some very uncomfortable adjustments”. Tyler Cowen reminded us that since 2003 he has been saying that professors are becoming obsolete. His main point is that thanks to the internet, the need for lecturers will greatly diminish. He goes on to predict that

the market was moving towards superstar teachers, who teach hundreds at a time or even thousands online. Today, we have the Khan Academy, a huge increase in online education, electronic textbooks and peer grading systems and highly successful superstar teachers with Michael Sandel and his popular course Justice, serving as example number one.

I think this is particularly true for stat and biostat graduate programs, especially in hard money environments.

Read More

The p>0.05 journal

I want to start a journal called “P>0.05”. This journal will publish all the negative results in science. These would also be stored in a database. Think of all the great things we could do with this. We could, for example, plot p-value histograms for different disciplines. I bet most would have a flat distribution. We could also do it by specific association. A paper comes out saying chocolate is linked to weaker bones? Check the histogram and keep eating chocolate. Any publishers interested? 

Tags: Proposal

"Unoriginal genius"

"The world is full of texts, more or less interesting; I do not wish to add any more"

This quote is from an article in the Chronicle Review. I highly recommend reading the article, particularly check out the section on the author’s “Uncreative writing” class at UPenn. The article is about how there is a trend in literature toward combining/using other people’s words to create new content. 

Read More

25 minute seminars

Most Statistics and Biostatistics departments have weekly seminars. We usually invite outside speakers to share their knowledge via a 50 minute powerpoint (or beamer) presentation. This gives us the opportunity to meet colleagues from other Universities and pick their brains in small group meetings. This is all great. But, giving a good one hour seminar is hard. Really hard. Few people can pull it off. I propose to the statistical community that we cut the seminars to 25 minutes with 35 minutes for questions and further discussion. We can make exceptions of course. But in general, I think we would all benefit from shorter seminars. 

Tags: Rant Proposal

Getting email responses from busy people

I’ve had the good fortune of working with some really smart and successful people during my career. As a young person, one problem with working with really successful people is that they get a ton of email. Some only see the subject lines on their phone before deleting them. 

I’ve picked up a few tricks for getting email responses from important/successful people:  

The SI Rules

  1. Try to send no more than one email a day. 
  2. Emails should be 3 sentences or less. Better if you can get the whole email in the subject line. 
  3. If you need information, ask yes or no questions whenever possible. Never ask a question that requires a full sentence response.
  4. When something is time sensitive, state the action you will take if you don’t get a response by a time you specify. 
  5. Be as specific as you can while conforming to the length requirements. 
  6. Bonus: include obvious keywords people can use to search for your email. 

Anecdotally, SI emails have a 10-fold higher response probability. The rules are designed around the fact that busy people who get lots of email love checking things off their list. SI emails are easy to check off! That will make them happy and get you a response. 

It takes more work on your end when writing an SI email. You often need to think more carefully about what to ask, how to phrase it succinctly, and how to minimize the number of emails you write. A surprising side effect of applying SI principles is that I often figure out answers to my questions on my own. I have to decide which questions to include in my SI emails and they have to be yes/no answers, so I end up taking care of simple questions on my own. 

Here are examples of SI emails just to get you started: 

Example 1

Subject: Is my response to reviewer 2 ok with you?

Body: I’ve attached the paper/responses to referees.

Example 2

Subject: Can you send my letter of recommendation to


Keywords = recommendation, Jeff, John Doe.

Example 3

Subject: I revised the draft to include your suggestions about simulations and language

Revisions attached. Let me know if you have any problems, otherwise I’ll submit Monday at 2pm. 

Dongle communism

If you have a mac and give talks or teach, chances are you have embarrassed yourself by forgetting your dongle. Our lab meetings and classes were constantly delayed due to missing dongles. Communism solved this problem. We bought 10 dongles, sprinkled them around the department, and declared all dongles public property. All dongles, not just the 10. No longer do we have to ask to borrow dongles because they have no owner. Please join the revolution. ps -I think this should apply to pens too!

Tags: Humor Proposal

The Killer App for Peer Review

A little while ago, over at Genomes Unzipped, Joe Pickrell asked, “Why publish science in peer reviewed journals?" He points out the flaws with the current peer review system and suggests how we can do better. What he suggests is missing is the killer app for peer review. 

Well, PLoS has now developed an API, where you can easily access tons of data on the papers published in those journals including downloads, citations, number of social bookmarks, and mentions in major science blogs. Along with Mendeley a free reference manager, they have launched an competition to build cool apps with their free data. 

Seems like with the right statistical analysis/cool features a recommender system for say, PLoS One could have most of the features suggested by Joe in his article. One idea would be an RSS-feed based on an idea like the Pandora music sharing service. You input a couple of papers you like from the journal, then it creates an RSS feed with papers similar to that paper.