- An important article about anti-science sentiment in the U.S. (via David S.). The politicization of scientific issues such as global warming, evolution, and healthcare (think vaccination) makes the U.S. less competitive. I think the lack of statistical literacy and training in the U.S. is one of the sources of the problem. People use/skew/mangle statistical analyses and experiments to support their view and without a statistically well trained public, it all looks “reasonable and scientific”. But when science seems to contradict itself, it loses credibility. Another reason to teach statistics to everyone in high school.
- Scientific American was loaded this last week, here is another article on cancer screening. The article covers several of the issues that make it hard to convince people that screening isn’t always good. The predictive value of the positive confusion is a huge one in cancer screening right now. The author of the piece is someone worth following on Twitter @hildabast.
- A bunch of data on the use of Github. Always cool to see new data sets that are worth playing with for student projects, etc. (via Hilary M.).
- A really interesting post over at Stats Chat about why we study seemingly obvious things. Hint, the reason is that “obvious” things aren’t always true.
- A story on “sentiment analysis” by NPR that suggests that most of the variation in a stock’s price during the day can be explained by the number of Facebook likes. Obviously, this is an interesting correlation. Probably more interesting for hedge funders/stockpickers if the correlation was with the change in stock price the next day. (via Dan S.)
- Yihui Xie visited our department this week. We had a great time chatting with him about knitr/animation and all the cool work he is doing. Here are his slides from the talk he gave. Particularly check out his idea for a fast journal. You are seeing the future of publishing.
- Bonus Link: R is a trendy open source technology for big data.
- A fascinating article about the debate on whether to regulate sugary beverages. One of the protagonists is David Allison, a statistical geneticist, among other things. It is fascinating to see the interplay of statistical analysis and public policy. Yet another example of how statistics/data will drive some of the most important policy decisions going forward.
- A related article is this one on the way risk is reported in the media. It is becoming more and more clear that to be an educated member of society now means that you absolutely have to have a basic understanding of the concepts of statistics. Both leaders and the general public are responsible for the danger that lies in misinterpreting/misleading with risk.
- A press release from the Census Bureau about how the choice of college major can have a major impact on career earnings. More data breaking the results down by employment characteristics and major are here and here. These data update some of the data we have talked about before in calculating expected salaries by major. (via Scott Z.)
- An interesting article about Recorded Future that describes how they are using social media data etc. to try to predict events that will happen. I think this isn’t an entirely crazy idea, but the thing that always strikes me about these sorts of project is how hard it is to measure success. It is highly unlikely you will ever exactly predict a future event, so how do you define how close you were? For instance, if you predicted an uprising in Egypt, but missed by a month, is that a good or a bad prediction?
- Seriously guys, this is getting embarrassing. An article appears in the New England Journal “finding” an association between chocolate consumption and Nobel prize winners. This is, of course, a horrible statistical analysis and unless it was a joke to publish it, it is irresponsible of the NEJM to publish. I’ll bet any student in Stat 101 could find the huge flaws with this analysis. If the editors of the major scientific journals want to continue publishing statistical papers, they should get serious about statistical editing.
I finally got around to reading the IOM report on translational omics and it is very good. The report lays out problems with current practices and how these led to undesired results such as the now infamous Duke trials and the growth in retractions in the scientific literature. Specific recommendations are provided related to reproducibility and validation. I expect the report will improve things. Although I think bigger improvements will come as a result of retirements.
In general, I think the field of genomics (a label that is used quite broadly) is producing great discoveries and I strongly believe we are just getting started. But we can’t help but notice that retraction and questionable findings are particularly high in this field. In my view most of the problems we are currently suffering stem from the fact that a substantial number of the people with positions of power do not understand statistics and have no experience with computing. Nevin’s biggest mistake was not admitting to himself that he did not understand what Baggerly and Coombes were saying. The lack of reproducibility just exacerbated the problem. The same is true for the editors that rejected the letters written by this pair in their effort to expose a serious problem - a problem that was obvious to all the statistics savvy biologists I talked to.
Unfortunately Nevins is not the only head of a large genomics lab that does not understand basic statistical principles and has no programming/data-management experience. So how do people without the necessary statistical and computing skills to be considered experts in genomics become leaders of the field? I think this is due to the speed at which Biology changed from a data poor discipline to a data intensive ones. For example, before microarrays, the analysis of gene expression data amounted to spotting black dots on a piece of paper (see Figure A below). In the mid 90s this suddenly changed to sifting through tens of thousands of numbers (see Figure B).
Note that typically, statistics is not a requirement of the Biology graduate programs associated with genomics. At Hopkins neither of the two major programs (CMM and BCMB) require it. And this is expected, since outside of genomics one can do great Biology without quantitative skills and for most of the 20th century most Biology was like this. So when the genomics revolution first arrived, the great majority of powerful Biology lab heads had no statistical training whatsoever. Nonetheless, a few of these decided to delve into this “sexy” new field and using their copious resources were able to perform some of the first big experiments. Similarly, Biology journals that were not equipped to judge the data analytic component of genomics papers were eager to publish papers in this field, a fact that further compounded the problem.
But I as I mentioned above, in general, the field of genomics is producing wonderful results. Several lab heads did have statistics and computational expertise, while others formed strong partnerships with quantitative types. Here I should mentioned that for these partnerships to be successful the statisticians also needed to expand their knowledge base. The quantitative half of the partnership needs to be biology and technology savvy or they too can make mistakes that lead to retractions.
Nevertheless, the field is riddled with problems; enough to prompt an IOM report. But although the present is somewhat grim, I am optimistic about the future. The new generation of biologists leading the genomics field are clearly more knowledgeable and appreciative about statistics and computing than the previous ones. Natural selection helps, as these new investigators can’t rely on pre-genomics-revolution accomplishments and those that do not posses these skills are simply outperformed by those that do. I am also optimistic because biology graduate programs are starting to incorporate statistics and computation into their curricula. For example, as of last year, our Human Genetics program requires our Biostats 615-616 course.
It looks like the journal Nature is hiring a Chief Data Editor (link via Hilary M.). It looks like the primary purpose of this editor is to develop tools for collecting, curating, and distributing data with the goal of improving reproducible research.
The main duties of the editor, as described by the ad are:
Nature Publishing Group is looking for a Chief Editor to develop a product aimed at making research data more available, discoverable and interpretable.
The ad also mentions having an eye for commercial potential; I wonder if this move was motivated by companies like figshare who are already providing a reproducible data service. I haven’t used figshare, but the early reports from friends who have are that it is great.
The thing that bothered me about the ad is that there is a strong focus on data collection/storage/management but absolutely no mention of the second component of the data science problem: making sense of the data. To make sense of piles of data requires training in applied statistics (called by whatever name you like best). The ad doesn’t mention any such qualifications.
Even if the goal of the position is just to build a competitor to figshare, it seems like a good idea for the person collecting the data to have some idea of what researchers are going to do with it. When dealing with data, those researchers will frequently be statisticians by one name or another.
Bottom line: I’m stoked Nature is recognizing the importance of data in this very prominent way. But I wish they’d realize that a data revolution also requires a revolution in statistics.
Statistics depends on math, like a lot of other disciplines (physics, engineering, chemistry, computer science). But just like those other disciplines, statistics is not math; math is just a tool used to solve statistical problems. Unlike those other disciplines, statistics gets lumped in with math in headlines. Whenever people use statistical analysis to solve an interesting problem, the headline reads:
“Math can be used to solve amazing problem X”
“The Math of Y”
Here are some examples:
The Mathematics of Lego - Using data on legos to estimate a distribution
The Mathematics of War - Using data on conflicts to estimate a distribution
Usain Bolt can run faster with maths (Tweet) - Turns out they analyzed data on start times to come to the conclusion
The Mathematics of Beauty - Analysis of data relating dating profile responses and photo attractiveness
These are just a few off of the top of my head, but I regularly see headlines like this. I think there are a couple reasons for math being grouped with statistics: (1) many of the founders of statistics were mathematicians first (but not all of them) (2) many statisticians still identify themselves as mathematicians, and (3) in some cases statistics and statisticians define themselves pretty narrowly.
With respect to (3), consider the following list of disciplines:
- Data science
- Machine learning
- Natural language processing
- Signal processing
- Business analytics
- Text mining
- Social science statistics
- Process control
All of these disciplines could easily be classified as “applied statistics”. But how many folks in each of those disciplines would classify themselves as statisticians? More importantly, how many would be claimed by statisticians?
I just read this fascinating post on _why, apparently a bit of a cult hero among enthusiasts of the Ruby programming language. One of the most interesting bits was The Little Coder’s Predicament, which boiled down essentially says that computer programming languages have grown too complex - so children/newbies can’t get the instant gratification when they start programming. He suggested a simplified “gateway language” that would get kids fired up about programming, because with a simple line of code or two they could make the computer do things like play some music or make a video.
I feel like there is a similar ramp up with data scientists. To be able to do anything cool/inspiring with data you need to know (a) a little statistics, (b) a little bit about a programming language, and (c) quite a bit about syntax.
Wouldn’t it be cool if there was an R package that solved the little data scientist’s predicament? The package would have to have at least some of these properties:
- It would have to be easy to load data sets, one line of not complicated code. You could write an interface for RCurl/read.table/download.file for a defined set of APIs/data sets so the command would be something like: load(“education-data”) and it would load a bunch of data on education. It would handle all the messiness of scraping the web, formatting data, etc. in the background.
- It would have to have a lot of really easy visualization functions. Right now, if you want to make pretty plots with ggplot(), plot(), etc. in R, you need to know all the syntax for pch, cex, col, etc. The plotting function should handle all this behind the scenes and make super pretty pictures.
- It would be awesome if the functions would include some sort of dynamic graphics (with svgAnnotation or a wrapper for D3.js). Again, the syntax would have to be really accessible/not too much to learn.
That alone would be a huge start. In just 2 lines kids could load and visualize cool data in a pretty way they could show their parents/friends.
The Twitter universe is abuzz about this article in the New York Times. Arthur Brisbane, who responds to reader’s comments, asks
I’m looking for reader input on whether and when New York Times news reporters should challenge “facts” that are asserted by newsmakers they write about.
He goes on to give a couple of examples of qualitative facts that reporters have used in stories without questioning the veracity of the claims. As many people pointed out in the comments, this is completely absurd. Of course reporters should check facts and report when the facts in their stories, or stated by candidates, are not correct. That is the purpose of news reporting.
But I think the question is a little more subtle when it comes to quantitative facts and statistics. Depending on what subsets of data you look at, what summary statistics you pick, and the way you present information - you can say a lot of different things with the same data. As long as you report what you calculated, you are technically reporting a fact - but it may be deceptive. The classic example is calculating median vs. mean home prices. If Bill Gates is in your neighborhood, no matter what the other houses cost, the mean price is going to be pretty high!
Two concrete things can be done to deal with the malleability of facts in the data age.
First, we need to require that our reporters, policy makers, politicians, and decision makers report the context of numbers they state. It is tempting to use statistics as blunt instruments, punctuating claims. Instead, we should demand that people using statistics to make a point embed them in the broader context. For example, in the case of housing prices, if a politician reports the mean home price in a neighborhood, they should be required to state that potential outliers may be driving that number up. How do we make this demand? By not believing any isolated statistics - statistics will only be believed when the source is quoted and the statistic is described.
But this isn’t enough, since the context and statistics will be meaningless without raising overall statisteracy (statistical literacy, not to be confused with numeracy). In the U.S. literacy campaigns have been promoted by library systems. Statisteracy is becoming just as critical; the same level of social pressure and assistance should be applied to individuals who don’t know basic statistics as those who don’t have basic reading skills. Statistical organizations, academic departments, and companies interested in analytics/data science/statistics all have a vested interest in raising the population statisteracy. Maybe a website dedicated to understanding the consequences of basic statistical concepts, rather than the concepts themselves?
And don’t forget to keep rating health news stories!
The Supreme Court’s interpretation of statistical correlation may determine the future of personalized medicine
The Supreme Court heard oral arguments last week in the case Mayo Collaborative Services vs. Prometheus Laboratories (No 10-1150). At issue is a patent Prometheus Laboratories holds for making decisions about the treatment of disease on the basis of a measurement of a specific, naturally occurring molecule and a corresponding calculation. The specific language at issue is a little technical, but the key claim from the patent under dispute is:
1. A method of optimizing therapeutic efficacy for treatment of an immune-mediated gastrointestinal disorder, comprising:
(a) administering a drug providing 6-thioguanine to a subject having said immune-mediated gastrointestinal disorder; and
(b) determining the level of 6-thioguanine in said subject having said immune-mediated gastrointestinal disorder,
wherein the level of 6-thioguanine less than about 230 pmol per 8x10^8 red blood cells indicates a need to increase the amount of said drug subsequently administered to said subject and
wherein the level of 6-thioguanine greater than about 400 pmol per 8x10^8 red blood cells indicates a need to decrease the amount of said drug subsequently administered to said subject.
So basically the patent is on a decision made about treatment on the basis of a statistical correlation. When the levels of a specific molecule (6-thioguanine) are too high, then the dose of a drug (thiopurine) should be decreased, if they are too low then the dose of the drug should be increased. Here (and throughout the post) correlation is interpreted more loosely as a relationship between two variables; rather than the strict definition as the linear relationship between two quantitative variables.
In today’s Wall Street Journal, Amy Marcus has a piece on the Citizen Science movement, focusing on citizen science in health in particular. I am fully in support of this enthusiasm and a big fan of citizen science - if done properly. There have already been some pretty big success stories. As more companies like Fitbit and 23andMe spring up, it is really easy to collect data about yourself (right Chris?). At the same time organizations like Patients Like Me make it possible for people with specific diseases or experiences to self-organize.
But the thing that struck me the most in reading the article is the importance of statistical literacy for citizen scientists, reporters, and anyone reading these articles. For example the article says:
The questions that most people have about their DNA—such as what health risks they face and how to prevent them—aren’t always in sync with the approach taken by pharmaceutical and academic researchers, who don’t usually share any potentially life-saving findings with the patients.
I think its pretty unlikely that any organization would hide life-saving findings from the public. My impression from reading the article is that this statement refers to keeping results blinded from patients/doctors during an experiment or clinical trial. Blinding is a critical component of clinical trials, which reduces many potential sources of bias in the results of a study. Obviously, once the trial/study has ended (or been stopped early because a treatment is effective) then the results are quickly disseminated.
Several key statistical issues are then raised in bullet-point form without discussion:
Amateurs may not collect data rigorously, they say, and may draw conclusions from sample sizes that are too small to yield statistically reliable results.
Having individuals collect their own data poses other issues. Patients may enter data only when they are motivated, or feeling well, rendering the data useless. In traditional studies, both doctors and patients are typically kept blind as to who is getting a drug and who is taking a placebo, so as not to skew how either group perceives the patients’ progress.
The article goes on to describe an anecdotal example of citizen science - which suffers from a key statistical problem (small sample size):
Last year, Ms. Swan helped to run a small trial to test what type of vitamin B people with a certain gene should take to lower their levels of homocysteine, an amino acid connected to heart-disease risk. (The gene affects the body’s ability to metabolize B vitamins.)
Seven people—one in Japan and six, including herself, in her local area—paid around $300 each to buy two forms of vitamin B and Centrum, which they took in two-week periods followed by two-week “wash-out” periods with no vitamins at all.
The article points out the issue:
The scientists clapped politely at the end of Ms. Swan’s presentation, but during the question-and-answer session, one stood up and said that the data was not statistically significant—and it could be harmful if patients built their own regimens based on the results.
But doesn’t carefully explain the importance of sample size, suggesting instead that the only reason why you need more people is “insure better accuracy”.
It strikes me that statistical literacy is critical if the citizen science movement is going to go forward. Ideas like experimental design, randomization, blinding, placebos, and sample size need to be in the toolbox of any practicing citizen scientist.
One major drawback is that there are very few places where the general public can learn about statistics. Mostly statistics is taught in university courses. Resources like the Kahn Academy and the Cartoon Guide to Statistics exist, but are only really useful if you are self motivated and have some idea of math/statistics to begin with.
Since knowledge of basic statistical concepts is quickly becoming indispensable for citizen science or even basic life choices like deciding on healthcare options, do we need “adult statistical literacy courses”? These courses could focus on the basics of experimental design and how to understand results in stories about science in the popular press. It feels like it might be time to add a basic understanding of statistics and data to reading/writing/arithmetic as critical life skills. I’m not the only one who thinks so.