Sunday Data/Statistics Link Roundup (10/28/12)

  1. An important article about anti-science sentiment in the U.S. (via David S.). The politicization of scientific issues such as global warming, evolution, and healthcare (think vaccination) makes the U.S. less competitive. I think the lack of statistical literacy and training in the U.S. is one of the sources of the problem. People use/skew/mangle statistical analyses and experiments to support their view and without a statistically well trained public, it all looks “reasonable and scientific”. But when science seems to contradict itself, it loses credibility. Another reason to teach statistics to everyone in high school.
  2. Scientific American was loaded this last week, here is another article on cancer screening.  The article covers several of the issues that make it hard to convince people that screening isn’t always good. The predictive value of the positive confusion is a huge one in cancer screening right now. The author of the piece is someone worth following on Twitter @hildabast.
  3. A bunch of data on the use of Github. Always cool to see new data sets that are worth playing with for student projects, etc. (via Hilary M.). 
  4. A really interesting post over at Stats Chat about why we study seemingly obvious things. Hint, the reason is that “obvious” things aren’t always true. 
  5. A story on “sentiment analysis” by NPR that suggests that most of the variation in a stock’s price during the day can be explained by the number of Facebook likes. Obviously, this is an interesting correlation. Probably more interesting for hedge funders/stockpickers if the correlation was with the change in stock price the next day. (via Dan S.)
  6. Yihui Xie visited our department this week. We had a great time chatting with him about knitr/animation and all the cool work he is doing. Here are his slides from the talk he gave. Particularly check out his idea for a fast journal. You are seeing the future of publishing.  
  7. Bonus Link: R is a trendy open source technology for big data

Sunday Data/Statistics Link Roundup (10/14/12)

  1. A fascinating article about the debate on whether to regulate sugary beverages. One of the protagonists is David Allison, a statistical geneticist, among other things. It is fascinating to see the interplay of statistical analysis and public policy. Yet another example of how statistics/data will drive some of the most important policy decisions going forward. 
  2. A related article is this one on the way risk is reported in the media. It is becoming more and more clear that to be an educated member of society now means that you absolutely have to have a basic understanding of the concepts of statistics. Both leaders and the general public are responsible for the danger that lies in misinterpreting/misleading with risk. 
  3. A press release from the Census Bureau about how the choice of college major can have a major impact on career earnings. More data breaking the results down by employment characteristics and major are here and here. These data update some of the data we have talked about before in calculating expected salaries by major. (via Scott Z.)
  4. An interesting article about Recorded Future that describes how they are using social media data etc. to try to predict events that will happen. I think this isn’t an entirely crazy idea, but the thing that always strikes me about these sorts of project is how hard it is to measure success. It is highly unlikely you will ever exactly predict a future event, so how do you define how close you were? For instance, if you predicted an uprising in Egypt, but missed by a month, is that a good or a bad prediction? 
  5. Seriously guys, this is getting embarrassing. An article appears in the New England Journal "finding" an association between chocolate consumption and Nobel prize winners.  This is, of course, a horrible statistical analysis and unless it was a joke to publish it, it is irresponsible of the NEJM to publish. I’ll bet any student in Stat 101 could find the huge flaws with this analysis. If the editors of the major scientific journals want to continue publishing statistical papers, they should get serious about statistical editing.

Sunday Data/Statistics Link Roundup (7/15/12)

  1. A really nice list of journals software/data release policies from Titus’ blog. Interesting that he couldn’t find a data/release policy for the New England Journal of Medicine. I wonder if that is because it publishes mostly clinical studies, where the data are often protected for privacy reasons? It seems like there is going to eventually be a big discussion of the relative importance of privacy and open data in the clinical world. 
  2. Some interesting software that can be used to build virtual workflows for computational science. It seems like a lot of data analysis is still done via “drag and drop” programs. I can’t help but wonder if our effort should be focused on developing drag and drop or educating the next generation of scientists to have minimum scripting capabilities. 
  3. We added StatsChat by Thomas L. and company to our blogroll. Lots of good stuff there, for example, this recent post on when randomized trials don’t help. You can also follow them on twitter.  
  4. A really nice post on processing public data with R. As more and more public data becomes available, from governments, companies, APIs, etc. the ability to quickly obtain, process, and visualize public data is going to be hugely valuable. 
  5. Speaking of public data, you could get it from APIs or from government websites. But beware those category 2 problems

Sunday data/statistics link roundup (6/10)

  1.  Yelp put a data set online for people to play with, including reviews, star ratings, etc. This could be a really neat data set for a student project. The data they have made available focuses on the area around 30 universities. My alma mater is one of them. 
  2. A sort of goofy talk about how to choose the optimal marriage partner when viewing the problem as an optimal stopping problem. The author suggests that you need to date around 196,132 partners to make sure you have made the optimal decision. Fortunately for the Simply Statistics authors, it took many fewer for us all to end up with our optimal matches. Via @fhuszar.
  3. An interesting article on the recent Kaggle contest that sought to identify statistical algorithms that could accurately match human scoring of written essays. Several students in my advanced biostatistics course competed in this competition and did quite well. I understand the need for these kinds of algorithms, since it takes a huge amount of human labor to score these essays well. But it also makes me a bit sad since it still seems even the best algorithms will have a hard time scoring creativity. For example, this phrase from my favorite president, doesn’t use big words, but it sure is clever, “I think there is only one quality worse than hardness of heart and that is softness of head.”
  4. A really good article by friend of the blog, Steven, on the perils of gene patents. This part sums it up perfectly, “Genes are not inventions. This simple fact, which no serious scientist would dispute, should be enough to rule them out as the subject of patents.” Simply Statistics has weighed in on this issue a couple of times before. But I think in light of 23andMe’s recent Parkinson’s patent it bears repeating. Here is an awesome summary of the issue from Genomics Lawyer.
  5. A proposal for a really fast statistics journal I wrote about a month or two ago. Expect more on this topic from me this week. 

Sunday data/statistics link roundup (5/27)

  1. Amanda Cox on the process they went through to come up with this graphic about the Facebook IPO. So cool to see how R is used in the development process. A favorite quote of mine, “But rather than bringing clarity, it just sort of looked chaotic, even to the seasoned chart freaks of 620 8th Avenue.” One of the more interesting things about posts like this is you get to see how statistics versus a deadline works. This is typically the role of the analyst, since they come in late and there is usually a deadline looming…
  2. An interview with Steve Blank about Silicon valley and how venture capitalists (VC’s) are focused on social technologies since they can make a profit quickly. A depressing/fascinating quote from this one is, “If I have a choice of investing in a blockbuster cancer drug that will pay me nothing for ten years,  at best, whereas social media will go big in two years, what do you think I’m going to pick? If you’re a VC firm, you’re tossing out your life science division.” He also goes on to say thank goodness for the NIH, NSF, and Google who are funding interesting “real science” problems. This probably deserves its own post later in the week, the difference between analyzing data because it will make money and analyzing data to solve a hard science problem. The latter usually takes way more patience and the data take much longer to collect. 
  3. An interesting post on how Obama’s analytics department ran an A/B test which improved the number of people who signed up for his mailing list. I don’t necessarily agree with their claim that they helped raise $60 million, there may be some confounding factors that mean that the individuals who sign up with the best combination of image/button don’t necessarily donate as much. But still, an interesting look into why Obama needs statisticians
  4. A cute statistics cartoon from @kristin_linn  via Chris V. Yes, we are now shamelessly reposting cute cartoons for retweets :-). 
  5. Rafa’s post inspired some interesting conversation both on our blog and on some statistics mailing lists. It seems to me that everyone is making an effort to understand the increasingly diverse field of statistics, but we still have a ways to go. I’m particularly interested in discussion on how we evaluate the contribution/effort behind making good and usable academic software. I think the strength of the Bioconductor community and the rise of Github among academics are a good start.  For example, it is really useful that Bioconductor now tracks the number of package downloads

Sunday data/statistics link roundup (4/29)

  1. Nature genetics has an editorial on the Mayo and Myriad cases. I agree with this bit: “In our opinion, it is not new judgments or legislation that are needed but more innovation. In the era of whole-genome sequencing of highly variable genomes, it is increasingly hard to justify exclusive ownership of particularly useful parts of the genome, and method claims must be more carefully described.” Via Andrew J.
  2. One of Tech Review’s 10 emerging technologies from a February 2003 article? Data mining. I think doing interesting things with data has probably always been a hot topic, it just gets press in cycles. Via Aleks J. 
  3. An infographic in the New York Times compares the profits and taxes of Apple over time, here is an explanation of how they do it. (Via Tim O.)
  4. Saw this tweet via Joe B. I’m not sure if the frequentists or the Bayesians are winning, but it seems to me that the battle no longer matters to my generation of statisticians - there are too many data sets to analyze, better to just use what works!
  5. Statistical and computational algorithms that write news stories. Simply Statistics remains 100% human written (for now). 
  6. The 5 most critical statistical concepts. 

Sunday data/statistics link roundup (4/8)

  1. This is a great article about the illusion of progress in machine learning. In part, I think it explains why the Leekasso (just using the top 10) isn’t a totally silly idea. I also love how he talks about sources of uncertainty in real prediction problems that aren’t part of the classical models when developing prediction algorithms. I think that this is a hugely underrated component of building an accurate classifier - just finding the quirks particular to a type of data. Via @chlalanne.
  2. An interesting post from Michael Eisen on a serious abuse of statistical ideas in the New York Times. The professor of genetics quoted in the story apparently wasn’t aware of the birthday problem. Lack of statistical literacy, even among scientists, is becoming critical. I would love it if the Kahn academy (or some enterprising students) would come up with a set of videos that just explained a bunch of basic statistical concepts - skipping all the hard math and focusing on the ideas. 
  3.  TechCrunch finally caught up to our Mayo vs. Prometheus coverage. This decision is going to affect more than just personalized medicine. Speaking of the decision, stay tuned for more on that topic from the folks over here at Simply Statistics. 
  4. How much is a megabyte? I love this question. They asked people on the street how much data was in a megabyte. The answers were pretty far ranging looks like. This question is hyper-critical for scientists in the new era, but the better question might be, “How much is a terabyte?”

Sunday data/statistics link roundup (3/25)

  1. The psychologist whose experiment didn’t replicate then went off on the scientists who did the replication experiment is at it again. I don’t see a clear argument about the facts of the matter in his post, just more name calling. This seems to be a case study in what not to do when your study doesn’t replicate. More on “conceptual replication” in there too. 
  2. Berkeley is running a data science course with instructors Jeff Hammerbacher and Mike Franklin, I looked through the notes and it looks pretty amazing. Stay tuned for more info about my applied statistics class which starts this week. 
  3. A cool article about Factual, one of the companies whose sole mission in life is to collect and distribute data. We’ve linked to them before. We are so out ahead of the Times on this one…
  4. This isn’t statistics related, but I love this post about Jeff Bezos. If we all indulged our inner 11 year old a little more, it wouldn’t be a bad thing. 
  5. If you haven’t had a chance to read Reeves guest post on the Mayo Supreme Court decision yet, you should, it is really interesting. A fascinating intersection of law and statistics is going on in the personalized medicine world right now. 

Sunday data/statistics link roundup (3/18)

  1. A really interesting proposal by Rafa (in Spanish - we’ll get on him to write a translation) for the University of Puerto Rico. The post concerns changing the focus from simply teaching to creating knowledge and the potential benefits to both the university and to Puerto Rico. It also has a really nice summary of the benefits that the university system in the United States has produced. Definitely worth a read. The comments are also interesting, it looks like Rafa’s post is pretty controversial…
  2. An interesting article suggesting that the Challenger Space Shuttle disaster was at least in part due to bad data visualization. Via @DatainColour
  3. The Snyderome is getting a lot of attention in genomics circles. He used as many new technologies as he could to measure a huge amount of molecular information about his body over time. I am really on board with the excitement about measurement technologies, but this poses a huge challenge for statistics and and statistical literacy. If this kind of thing becomes commonplace, the potential for false positives and ghost diagnoses is huge without a really good framework for uncertainty. Via Peter S. 
  4. More news about the Nike API. Now that is how to unveil some data! 
  5. Add the Nike API to the list of potential statistics projects for students. 

Sunday Data/Statistics Link Roundup (3/11)

  1. This is the big one. ESPN has opened up access to their API! It looks like there may only be access to some of the data for the general public though, does anyone know more? 
  2. Looks like ESPN isn’t the only sports-related organization in the API mood, Nike plans to open up an API too. It would be great if they had better access to individual, downloadable data. 
  3. Via Leonid K.: a highly influential psychology study failed to replicate in a study published in PLoS One. The author of the original study went off on the author of the paper, on PLoS One, and on the reporter who broke the story (including personal attacks!). It looks like the authors of the PLoS One paper actually did a more careful study than the original authors to me. The authors of the PLoS One paper, the reporter, and the editor of PLoS One all replied in a much more reasonable way. See this excellent summary for all the details. Here are a few choice quotes from the comments: 

1. But there’s a long tradition in social psychology of experiments as parables,

2. I’d love to write a really long response, but let’s just say: priming methods like these fail to replicate all the time (frequently in my own studies), and the news that one of Bargh’s studies failed to replicate is not surprising to me at all.

3. This distinction between direct and conceptual replication helps to explain why a psychologist isn’t particularly concerned whether Bargh’s finding replicates or not.

D.  Reproducible != Replicable in scientific research. But Roger’s perspective on reproducible research still seems appropriate here.