How important is abstract thinking for graduate students in statistics?

A recent lunchtime discussion here at Hopkins brought up the somewhat-controversial topic of abstract thinking in our graduate program. We, like a lot of other biostatistics/statistics programs, require our students to take measure theoretic probability as part of the curriculum. The discussion started as a conversation about whether we should require measure theoretic probability for our students. It evolved into a discussion of the value of abstract thinking (and whether measure theoretic probability was a good tool to measure abstract thinking).

Brian Caffo and I decided an interesting idea would be a point-counterpoint with the prompt, “How important is abstract thinking for the education of statistics graduate students?” Next week Brian and I will provide a point-counterpoint response based on our discussion.

In the meantime we’d love to hear your opinions!

Statistics is not math…

Statistics depends on math, like a lot of other disciplines (physics, engineering, chemistry, computer science). But just like those other disciplines, statistics is not math; math is just a tool used to solve statistical problems. Unlike those other disciplines, statistics gets lumped in with math in headlines. Whenever people use statistical analysis to solve an interesting problem, the headline reads:

"Math can be used to solve amazing problem X"

or

"The Math of Y" 

Here are some examples:

The Mathematics of Lego - Using data on legos to estimate a distribution

The Mathematics of War - Using data on conflicts to estimate a distribution

Usain Bolt can run faster with maths (Tweet) - Turns out they analyzed data on start times to come to the conclusion

The Mathematics of Beauty - Analysis of data relating dating profile responses and photo attractiveness

These are just a few off of the top of my head, but I regularly see headlines like this. I think there are a couple reasons for math being grouped with statistics: (1) many of the founders of statistics were mathematicians first (but not all of them) (2) many statisticians still identify themselves as mathematicians, and (3) in some cases statistics and statisticians define themselves pretty narrowly. 

With respect to (3), consider the following list of disciplines:

  1. Biostatistics
  2. Data science
  3. Machine learning
  4. Natural language processing
  5. Signal processing
  6. Business analytics
  7. Econometrics
  8. Text mining
  9. Social science statistics
  10. Process control

All of these disciplines could easily be classified as “applied statistics”. But how many folks in each of those disciplines would classify themselves as statisticians? More importantly, how many would be claimed by statisticians? 

What is a major revision?

I posted a little while ago on a proposal for a fast statistics journal. It generated a bunch of comments and even a really nice follow up post with some great ideas. Since then I’ve gotten reviews back on a couple of papers and I think I realized one of the key issues that is driving me nuts about the current publishing model. It boils down to one simple question: 

What is a major revision? 

I often get reviews back that suggest “major revisions” in one or many of the following categories:

  1. More/different simulations
  2. New simulations
  3. Re-organization of content
  4. Re-writing language
  5. Asking for more references
  6. Asking me to include a new method
  7. Asking me to implement someone else’s method for comparison
I don’t consider any of these major revisions. Personally, I have stopped asking for them as major revisions. In my opinion, major revisions should be reserved for issues with the manuscript that suggest that it may be reporting incorrect results. Examples include:
  1. No simulations
  2. No real data
  3. The math/computations look incorrect
  4. The software didn’t work when I tried it
  5. The methods/algorithms are unreadable and can’t be followed
The first list is actually a list of minor/non-essential revisions in my opinion. They may improve my paper, but they won’t confirm that it is correct or not. I find that they are often subjective and are up to the whims of referees. In my own personal refereeing I am making an effort to remove subjective major revisions and only include issues that are critical to evaluate the correctness of a manuscript. I also try to divorce the issues of whether an idea is interesting or not from whether an idea is correct or not. 
I’d be curious to know what other peoples’ definitions of major/minor revisions are?




Why statisticians should join and launch startups

The tough economic times we live in, and the potential for big paydays, have made entrepreneurship cool. From the venture capitalist-in-chief, to the javascript coding mayor of New York, everyone is on board. No surprise there, successful startups lead to job creation which can have a major positive impact on the economy. 

The game has been dominated for a long time by the folks over in CS. But the value of many recent startups is either based on, or can be magnified by, good data analysis. Here are a few startups that are based on data/data analysis: 

  1. The Climate Corporation -analyzes climate data to sell farmers weather insurance.
  2. Flightcaster - uses public data to predict flight delays
  3. Quid - uses data on startups to predict success, among other things.
  4. 100plus - personalized health prediction startup, predicting health based on public data
  5. Hipmunk - The main advantage of this site for travel is better data visualization and an algorithm to show you which flights have the worst “agony”.

To launch a startup you need just a couple of things: (1) a good, valuable source of data (there are lots of these on the web) and (2) a good idea about how to analyze them to create something useful. The second step is obviously harder than the first, but the companies above prove you can do it. Then, once it is built, you can outsource/partner with developers - web and otherwise - to implement your idea. If you can build it in R, someone can make it an app. 

These are just a few of the startups whose value is entirely derived from data analysis. But companies from LinkedIn, to Bitly, to Amazon, to Walmart are trying to mine the data they are generating to increase value. Data is now being generated at unprecedented scale by computers, cell phones, even thremostats! With this onslaught of data, the need for people with analysis skills is becoming incredibly acute

Statisticians, like computer scientists before them, are poised to launch, and make major contributions to, the next generation of startups. 

Help us rate health news reporting with citizen-science powered http://www.healthnewsrater.com

We here at Simply Statistics are big fans of science news reporting. We read newspapers, blogs, and the news sections of scientific journals to keep up with the coolest new research. 

But health science reporting, although exciting, can also be incredibly frustrating to read. Many articles have sensational titles, like "How using Facebook could raise your risk of cancer". The articles go on to describe some research and interview a few scientists, then typically make fairly large claims about what the research means. This isn’t surprising - eye catching headlines are important in this era of short attention spans and information overload. 

If just a few extra pieces of information were reported in science stories about the news, it would be much easier to evaluate whether the cancer risk was serious enough to shut down our Facebook accounts. In particular we thought any news story should report:

  1. A link back to the original research article where the study (or studies) being described was published. Not just a link to another news story. 
  2. A description of the study design (was it a randomized clinical trial? a cohort study? 3 mice in a lab experiment?)
  3. Who funded the study - if a study involving cancer risk was sponsored by a tobacco company, that might say something about the results.
  4. Potential financial incentives of the authors - if the study is reporting a new drug and the authors work for a drug company, that might say something about the study too. 
  5. The sample size - many health studies are based on a very small sample size, only 10 or 20 people in a lab. Results from these studies are much weaker than results obtained from a large study of thousands of people. 
  6. The organism - Many health science news reports are based on studies performed in lab animals and may not translate to human health. For example, here is a report with the headline "Alzheimers may be transmissible, study suggests". But if you read the story, scientists injected Alzheimer’s afflicted brain tissue from humans into mice. 

So we created a citizen-science website for evaluating health news reporting called HealthNewsRater. It was built by Andrew Jaffe and Jeff Leek, with Andrew doing the bulk of the heavy lifting.  We would like you to help us collect data on the quality of health news reporting. When you read a health news story on the Nature website, at nytimes.com, or on a blog, we’d like you to take a second to report on the news. Just determine whether the 6 pieces of information above are reported and input the data at HealthNewsRater.

We calculate a score for each story based on the formula:

HNR-Score = (5 points for a link to the original article + 1 point each for the other criteria)/2

The score weights the link to the original article very heavily, since this is the best source of information about the actual science underlying the story. 

In a future post we will analyze the data we have collected, make it publicly available, and let you know which news sources are doing the best job of reporting health science. 

Update: If you are a web-developer with an interest in health news contact us to help make HealthNewsRater better! 

Sunday Data/Statistics Link Roundup

A few data/statistics related links of interest:

  1. Eric Lander Profile
  2. The math of lego (should be “The statistics of lego”)
  3. Where people are looking for homes.
  4. Hans Rosling’s Ted Talk on the Developing world (an oldie but a goodie)
  5. Elsevier is trying to make open-access illegal (not strictly statistics related, but a hugely important issue for academics who believe government funded research should be freely accessible), more here

Data Scientist vs. Statistician

There’s in interesting discussion over at reddit on the difference between a data scientist and a statistician. My crude summary of the discussion seems to be that by and large they are the same but the phrase “data scientist” is just the hip new name for statistician that will probably sound stupid 5 years from now.

My question is why isn’t “statistician” hip? The comments don’t seem to address that much (although a few go in that direction).  There a few interesting comments about computing. For example from ByteMining:

Statisticians typically don’t care about performance or coding style as long as it gets a result. A loop within a loop within a loop is all the same as an O(1) lookup.

Another more down-to-earth comment comes from marshallp:

There is a real distinction between data scientist and statistician

  • the statistician spent years banging his/her head against blackboards full of math notation to get a modestly paid job

  • the data scientist gets s—loads of cash after having learnt a scripting language and an api

More people should be encouraged into data science and not pointless years of stats classes

 Not sure I fully agree but I see where he’s coming from!

[Note: See also our post on how determine whether you are a data scientist.]

The History of Nonlinear Principal Components Analysis, a lecture given by Jan de Leeuw. For those that have ~45 minutes to spare, it’s a very nice talk given in Jan’s characteristic style.

Coarse PM and measurement error paper

Howard Chang, a former PhD student of mine now at Emory, just published a paper on a measurement error model for estimating the health effects of coarse particulate matter (PM). This is a cool paper that deals with the problem that coarse PM tends to be very spatially heterogeneous. Coarse PM is a bit of a hot topic now because there is currently no national ambient air quality standard for coarse PM specifically. There is a standard for fine PM, but compared to fine PM,  the scientific evidence for health effects of coarse PM is relatively less developed. 

When you want to assign a coarse PM exposure level to people in a county (assuming you don’t have personal monitoring) there is a fair amount of uncertainty about the assignment because of the spatial variability. This is in contrast to pollutants like fine PM or ozone which tend to be more spatially smooth. Standard approaches essentially ignore the uncertainty which may lead to some bias in estimates of the health effects.

Howard developed a measurement error model that uses observations from multiple monitors to estimate the spatial variability and correct for it in time series regression models estimating the health effects of coarse PM. Another nice thing about his approach is that it avoids any complex spatial-temporal modeling to do the correction.

Related Posts: Jeff on “Cool papers" and "Dissecting the genomics of trauma

Do we really need applied statistics journals?

All statisticians in academia are constantly confronted with the question of where to publish their papers. Sometimes it’s obvious: A theoretical paper might go to the Annals of Statistics or JASA Theory & Methods or Biometrika. A more “methods-y” paper might go to JASA or JRSS-B or Biometrics or maybe even Biostatistics (where all three of us are or have been associate editors).

But where should the applied papers go? I think this is an increasingly large category of papers being produced by statisticians. These are papers that do not necessarily develop a brand new method or uncover any new theory, but apply statistical methods to an interesting dataset in a not-so-obvious way. Some papers might combine a set of existing methods that have never been combined before in order to solve an important scientific problem.

Well, there are some official applied statistics journals: JASA Applications & Case Studies or JRSS-C or Annals of Applied Statistics. At least they have the word “application” or “applied” in their title. But the question we should be asking is if a paper is published in one of those journals, will it reach the right audience?

What is the audience for an applied stat paper? Perhaps it depends on the subject matter. If the application is biology, then maybe biologists. If it’s an air pollution and health application, maybe environmental epidemiologists. My point is that the key audience is probably not a bunch of other statisticians.

The fundamental conundrum of applied stat papers comes down to this question: If your application of statistical methods is truly addressing an important scientific question, then shouldn’t the scientists in the relevant field want to hear about it? If the answer is yes, then we have two options: Force other scientists to read our applied stat journals, or publish our papers in their journals. There doesn’t seem to be much momentum for the former, but the latter is already being done rather frequently. 

Across a variety of fields we see statisticians making direct contributions to science by publishing in non-statistics journals. Some examples are this recent paper in Nature Genetics or a paper I published a few years ago in the Journal of the American Medical Association. I think there are two key features that these papers (and many others like them) have in common:

  • There was an important scientific question addressed. The first paper investigates variability of methylated regions of the genome and its relation to cancer tissue and the second paper addresses the problem of whether ambient coarse particles have an acute health effect. In both cases, scientists in the respective substantive areas were interested in the problem and so it was natural to publish the “answer” in their journals. 
  • The problem was well-suited to be addressed by statisticians. Both papers involved large and complex datasets for which training in data analysis and statistics was important. In the analysis of coarse particles and hospitalizations, we used a national database of air pollution concentrations and obtained health status data from Medicare. Linking these two databases together and conducting the analysis required enormous computational effort and statistical sophistication. While I doubt we were the only people who could have done that analysis, we were very well-positioned to do so. 

So when statisticians are confronted by a scientific problems that are both (1) important and (2) well-suited for statisticians, what should we do? My feeling is we should skip the applied statistics journals and bring the message straight to the people who want/need to hear it.

There are two problems that come to mind immediately. First, sometimes the paper ends up being so statistically technical that a scientific journal won’t accept it. And of course, in academia, there is the sticky problem of how do you get promoted in a statistics department when your CV is filled with papers in non-statistics journals. This entry is already long enough so I’ll address these issues in a future post.

Related Posts: Rafa on "Where are the Case Studies?" and "Authorship Conventions"