Sunday Data/Statistics Link Roundup (10/14/12)

  1. A fascinating article about the debate on whether to regulate sugary beverages. One of the protagonists is David Allison, a statistical geneticist, among other things. It is fascinating to see the interplay of statistical analysis and public policy. Yet another example of how statistics/data will drive some of the most important policy decisions going forward. 
  2. A related article is this one on the way risk is reported in the media. It is becoming more and more clear that to be an educated member of society now means that you absolutely have to have a basic understanding of the concepts of statistics. Both leaders and the general public are responsible for the danger that lies in misinterpreting/misleading with risk. 
  3. A press release from the Census Bureau about how the choice of college major can have a major impact on career earnings. More data breaking the results down by employment characteristics and major are here and here. These data update some of the data we have talked about before in calculating expected salaries by major. (via Scott Z.)
  4. An interesting article about Recorded Future that describes how they are using social media data etc. to try to predict events that will happen. I think this isn’t an entirely crazy idea, but the thing that always strikes me about these sorts of project is how hard it is to measure success. It is highly unlikely you will ever exactly predict a future event, so how do you define how close you were? For instance, if you predicted an uprising in Egypt, but missed by a month, is that a good or a bad prediction? 
  5. Seriously guys, this is getting embarrassing. An article appears in the New England Journal "finding" an association between chocolate consumption and Nobel prize winners.  This is, of course, a horrible statistical analysis and unless it was a joke to publish it, it is irresponsible of the NEJM to publish. I’ll bet any student in Stat 101 could find the huge flaws with this analysis. If the editors of the major scientific journals want to continue publishing statistical papers, they should get serious about statistical editing.

Nature is hiring a data editor…how will they make sense of the data?

It looks like the journal Nature is hiring a Chief Data Editor (link via Hilary M.). It looks like the primary purpose of this editor is to develop tools for collecting, curating, and distributing data with the goal of improving reproducible research.

The main duties of the editor, as described by the ad are: 

Nature Publishing Group is looking for a Chief Editor to develop a product aimed at making research data more available, discoverable and interpretable.

The ad also mentions having an eye for commercial potential; I wonder if this move was motivated by companies like figshare who are already providing a reproducible data service. I haven’t used figshare, but the early reports from friends who have are that it is great. 

The thing that bothered me about the ad is that there is a strong focus on data collection/storage/management but absolutely no mention of the second component of the data science problem: making sense of the data. To make sense of piles of data requires training in applied statistics (called by whatever name you like best). The ad doesn’t mention any such qualifications. 

Even if the goal of the position is just to build a competitor to figshare, it seems like a good idea for the person collecting the data to have some idea of what researchers are going to do with it. When dealing with data, those researchers will frequently be statisticians by one name or another. 

Bottom line: I’m stoked Nature is recognizing the importance of data in this very prominent way. But I wish they’d realize that a data revolution also requires a revolution in statistics. 

An essay on why programmers need to learn statistics

This is awesome. There are a few places with some strong language, but overall I think the message is pretty powerful. Via Tariq K. I agree with Tariq, one of the gems is:

If you want to measure something, then don’t measure other sh**. 

Fundamentals of Engineering Review Question Oops

The Fundamentals of Engineering Exam is the first licensing exam for engineers. You have to pass it on your way to becoming a professional engineer (PE). I was recently shown a problem from a review manual: 

When it is operating properly, a chemical plant has a daily production rate that is normally distributed with a mean of 880 tons/day and a standard deviation of 21 tons/day. During an analysis period, the output is measured with random sampling on 50 consecutive days, and the mean output is found to be 871 tons/day. With a 95 percent confidence level, determine if the plant is operating properly. 

  1. There is at least a 5 percent probability that the plant is operating properly. 
  2. There is at least a 95 percent probability that the plant is operating properly. 
  3. There is at least a 5 percent probability that the plant is not operating properly. 
  4. There is at least a 95 percent probability that the plant is not operating properly. 

Whoops…seems to be a problem there. I’m glad that engineers are expected to know some statistics; hopefully the engineering students taking the exam can spot the problem…but then how do they answer? 

figshare and don’t trust celebrities stating facts

A couple of links:

  1. figshare is a site where scientists can share data sets/figures/code. One of the goals is to encourage researchers to share negative results as well. I think this is a great idea - I often find negative results and this could be a place to put them. It also uses a tagging system, like Flickr. I think this is a great idea for scientific research discovery. They give you unlimited public space and 1GB of private space. This could be big, a place to help make reproducible research efforts user-friendly. Via TechCrunch
  2. Don’t trust celebrities stating facts because they usually don’t know what they are talking about. I completely agree with this. Particularly because I have serious doubts about the statisteracy of most celebrities. Nod to Alex for the link (our most active link finder!).  

Sunday Data/Statistics Link Roundup

  1. Statistics help for journalists (don’t forget to keep rating stories!) This is the kind of thing that could grow into a statisteracy page. The author also has a really nice plug for public schools
  2. An interactive graphic to determine if you are in the 1% from the New York Times (I’m not…).
  3. Mike Bostock’s d3.js presentation, this is some really impressive visualization software. You have to change the slide numbers manually but it is totally worth it. Check out slide 10 and slide 14. This is the future of data visualization. Here is a beginners tutorial to d3.js by Mike Dewar.
  4. An online diagnosis prediction start-up (Symcat) based on data analysis from two Hopkins Med students.

Finally, a bit of a bleg. I’m going to try to make this link roundup a regular post. If you have ideas for links I should include, tweet us @simplystats or send them to Jeff’s email. 

In the era of data what is a fact?

The Twitter universe is abuzz about this article in the New York Times. Arthur Brisbane, who responds to reader’s comments, asks 

I’m looking for reader input on whether and when New York Times news reporters should challenge “facts” that are asserted by newsmakers they write about.

He goes on to give a couple of examples of qualitative facts that reporters have used in stories without questioning the veracity of the claims. As many people pointed out in the comments, this is completely absurd. Of course reporters should check facts and report when the facts in their stories, or stated by candidates, are not correct. That is the purpose of news reporting. 

But I think the question is a little more subtle when it comes to quantitative facts and statistics. Depending on what subsets of data you look at, what summary statistics you pick, and the way you present information - you can say a lot of different things with the same data. As long as you report what you calculated, you are technically reporting a fact - but it may be deceptive. The classic example is calculating median vs. mean home prices. If Bill Gates is in your neighborhood, no matter what the other houses cost, the mean price is going to be pretty high! 

Two concrete things can be done to deal with the malleability of facts in the data age.

First, we need to require that our reporters, policy makers, politicians, and decision makers report the context of numbers they state. It is tempting to use statistics as blunt instruments, punctuating claims. Instead, we should demand that people using statistics to make a point embed them in the broader context. For example, in the case of housing prices, if a politician reports the mean home price in a neighborhood, they should be required to state that potential outliers may be driving that number up. How do we make this demand? By not believing any isolated statistics - statistics will only be believed when the source is quoted and the statistic is described.  

But this isn’t enough, since the context and statistics will be meaningless without raising overall statisteracy (statistical literacy, not to be confused with numeracy).  In the U.S. literacy campaigns have been promoted by library systems. Statisteracy is becoming just as critical; the same level of social pressure and assistance should be applied to individuals who don’t know basic statistics as those who don’t have basic reading skills. Statistical organizations, academic departments, and companies interested in analytics/data science/statistics all have a vested interest in raising the population statisteracy. Maybe a website dedicated to understanding the consequences of basic statistical concepts, rather than the concepts themselves?

And don’t forget to keep rating health news stories!