Simply Statistics

Month

September 2011

44 posts

Battling Bad Science

Here is a pretty awesome TED talk by epidemiologist Ben Goldacre where he highlights how science can be used to deceive/mislead. It’s sort of like epidemiology 101 in 15 minutes. 

This seems like a highly topical talk. Over on his blog, Steven Salzberg has pointed out that Dr. Oz has recently been engaging in some of these shady practices on his show. Too bad he didn’t check out the video first. 

In the comments section of the TED talk, one viewer points out that Dr. Goldacre doesn’t talk about the role of the FDA and other regulatory agencies. I think that regulatory agencies are under-appreciated and deserve credit for addressing many of these potential problems in the conduct of clinical trials. 

Maybe there should be an agency regulating how science is reported in the news? 

Sep 30, 20111 note
#salzberg #ted talk #link
Why does Obama need statisticians?

It’s worth following up a little on why the Obama campaign is recruiting statisticians (note to Karen: I am not looking for a new job!). Here’s the blurb for the position of “Statistical Modeling Analyst”:

The Obama for America Analytics Department analyzes the campaign’s data to guide election strategy and develop quantitative, actionable insights that drive our decision-making. Our team’s products help direct work on the ground, online and on the air. We are a multi-disciplinary team of statisticians, mathematicians, software developers, general analysts and organizers - all striving for a single goal: re-electing President Obama. We are looking for staff at all levels to join our department from now through Election Day 2012 at our Chicago, IL headquarters.

Statistical Modeling Analysts are charged with predicting electoral outcomes using statistical models. These models will be instrumental in helping the campaign determine how to most effectively use its resources.

I wonder if there’s a bonus for predicting the correct outcome, win or lose?

The Obama campaign didn’t invent the idea of heavy data analysis in campaigns, but they seem to be heavy adopters. There are 3 openings in the “Analytics” category as of today.

Now, can someone tell me why they don’t just call it simply “Statistics”?

Sep 29, 20119 notes
#obama #statistician #jobs
Kindle Fire and Machine Learning

Amazon released it’s new iPad competitor, the Kindle Fire, today. A quick read through the description shows it has some interesting features, including a custom-built web browser called Silk. One innovation that they claim is that the browser works in conjunction with Amazon’s EC2 cloud computing platform to speed up the web-surfing experience by doing some computing on your end and some on their end. Seems cool, if it really does make things faster.

Also there’s this interesting bit:

Machine Learning

Finally, Silk leverages the collaborative filtering techniques and machine learning algorithms Amazon has built over the last 15 years to power features such as “customers who bought this also bought…” As Silk serves up millions of page views every day, it learns more about the individual sites it renders and where users go next. By observing the aggregate traffic patterns on various web sites, it refines its heuristics, allowing for accurate predictions of the next page request. For example, Silk might observe that 85 percent of visitors to a leading news site next click on that site’s top headline. With that knowledge, EC2 and Silk together make intelligent decisions about pre-pushing content to the Kindle Fire. As a result, the next page a Kindle Fire customer is likely to visit will already be available locally in the device cache, enabling instant rendering to the screen.

That seems like a logical thing for Amazon to do. While the idea of pre-fetching pages is not particularly new, I haven’t yet heard of the idea of doing data analysis on web pages to predict which things to pre-fetch. One issue this raises in my mind, is that in order to do this, Amazon needs to combine information across browsers, which means your surfing habits will become part of one large mega-dataset. Is that what we want?

On the one hand, Amazon already does some form of this by keeping track of what you buy. But keeping track of every web page you goto and what links you click on seems like a much wider scope.

Sep 29, 201115 notes
#machine learning #kindle #ec2
Once in a lifetime collapse

image

Baseball Prospectus uses Monte Carlo simulation to predict which teams will make the postseason. According to this page, on Sept 1st, the probability of the Red Sox making the playoffs was 99.5%. They were ahead of the Tampa Bay Rays by 9 games. Before last night’s game, in September, the Red Sox had lost 19 of 26 games and were tied with the Rays for the wild card (the last spot for the playoffs). To make this event even more improbable, The Red Sox were up by one in the ninth with two outs and no one on for the last place Orioles. In this situation the team that’s winning, wins more than 95% of the time. The Rays were in exactly the same situation as the Orioles, losing to the first place Yankees (well, their subs). So guess what happened? The Red Sox lost, the Rays won. But perhaps the most amazing event is that these two games, both lasting much more than usual (one due to rain the other to extra innings) ended within seconds of each other. 

Update: Nate Silver beat me to it. And has much more!

Sep 29, 2011
#baseball #nate silver
Obama recruiting analysts who know R → rdatamining.wordpress.com
Sep 28, 20111 note
#Link
The Open Data Movement

I’m not sure which of the categories this infographic on open data falls into, but I find it pretty exciting anyway. It shows the rise of APIs and how data are increasingly open. It seems like APIs are all over the place in the web development community, but less so in health statistics. Although, from the comments, John M. posts places to find free government data including some health data: 

1) CDC’s National Center for Health Statistics, http://www.cdc.gov/nchs/
2) NHANES (National and Health and Nutrition Examination Survey)  http://www.cdc.gov/nchs/nhanes.htm
3) National Health Interview Survey: http://www.cdc.gov/nchs/nhis.htm
4) World Health Organization: www.who.gov
5) US Census Bureau: www.uscensus.gov
6) Emory maintains a repository of links related to stats/biostat including online databases 

http://www.sph.emory.edu/cms/departments_centers/bios/resources.html#govlist

Sep 28, 20114 notes
#Data
The future of graduate education

Stanford is offering a free online course and more than 100,000 students have registered. This got the blogosphere talking about the future of universities. Matt Yglesias thinks that “colleges are the next newspaper and are destined for some very uncomfortable adjustments”. Tyler Cowen reminded us that since 2003 he has been saying that professors are becoming obsolete. His main point is that thanks to the internet, the need for lecturers will greatly diminish. He goes on to predict that

the market was moving towards superstar teachers, who teach hundreds at a time or even thousands online. Today, we have the Khan Academy, a huge increase in online education, electronic textbooks and peer grading systems and highly successful superstar teachers with Michael Sandel and his popular course Justice, serving as example number one.

I think this is particularly true for stat and biostat graduate programs, especially in hard money environments.

Read More →

Sep 28, 20111 note
#Proposal #Education
The p>0.05 journal

I want to start a journal called “P>0.05”. This journal will publish all the negative results in science. These would also be stored in a database. Think of all the great things we could do with this. We could, for example, plot p-value histograms for different disciplines. I bet most would have a flat distribution. We could also do it by specific association. A paper comes out saying chocolate is linked to weaker bones? Check the histogram and keep eating chocolate. Any publishers interested? 

Sep 27, 2011
#Proposal
Some cool papers

  1. A cool article on the regulator’s dilemma. It turns out what is the best risk profile to prevent one bank from failing is not the best risk profile to prevent all banks from failing. 
  2. Persistence of web resources for computational biology. I think this one is particularly relevant for academic statisticians since a lot of academic software/packages are developed by graduate students. Once they move on, a large chunk of “institutional knowledge” is lost. 
  3. Are private schools better than public schools? A quote from the paper: “Indeed when comparing the average score in the two types of schools after adjusting for the enrollment effects, we find quite surprisingly that public schools perform better on average.
Sep 27, 2011
#LiteratureWatch
"Unoriginal genius"

“The world is full of texts, more or less interesting; I do not wish to add any more”

This quote is from an article in the Chronicle Review. I highly recommend reading the article, particularly check out the section on the author’s “Uncreative writing” class at UPenn. The article is about how there is a trend in literature toward combining/using other people’s words to create new content. 

Read More →

Sep 26, 2011
#Proposal #Literature
25 minute seminars

Most Statistics and Biostatistics departments have weekly seminars. We usually invite outside speakers to share their knowledge via a 50 minute powerpoint (or beamer) presentation. This gives us the opportunity to meet colleagues from other Universities and pick their brains in small group meetings. This is all great. But, giving a good one hour seminar is hard. Really hard. Few people can pull it off. I propose to the statistical community that we cut the seminars to 25 minutes with 35 minutes for questions and further discussion. We can make exceptions of course. But in general, I think we would all benefit from shorter seminars. 

Sep 26, 2011
#Rant #Proposal
By poring over statistics ignored by conventional scouts, - 05.12.03 - SI Vault → sportsillustrated.cnn.com

Sports Illustrated has (re)posted Michael Lewis’ original story on Billy Beane and the Oakland A’s (later the made into the book Moneyball). Catch it while you can—it’s a fascinating read.

Sep 25, 2011
#Link
How do you spend your day?

I’ve seen visualizations of how people spend their time a couple of places. Here is a good one over at Flowing Data. 

Sep 24, 2011
#Link
Getting email responses from busy people

I’ve had the good fortune of working with some really smart and successful people during my career. As a young person, one problem with working with really successful people is that they get a ton of email. Some only see the subject lines on their phone before deleting them. 

I’ve picked up a few tricks for getting email responses from important/successful people:  

The SI Rules

  1. Try to send no more than one email a day. 
  2. Emails should be 3 sentences or less. Better if you can get the whole email in the subject line. 
  3. If you need information, ask yes or no questions whenever possible. Never ask a question that requires a full sentence response.
  4. When something is time sensitive, state the action you will take if you don’t get a response by a time you specify. 
  5. Be as specific as you can while conforming to the length requirements. 
  6. Bonus: include obvious keywords people can use to search for your email. 

Anecdotally, SI emails have a 10-fold higher response probability. The rules are designed around the fact that busy people who get lots of email love checking things off their list. SI emails are easy to check off! That will make them happy and get you a response. 

It takes more work on your end when writing an SI email. You often need to think more carefully about what to ask, how to phrase it succinctly, and how to minimize the number of emails you write. A surprising side effect of applying SI principles is that I often figure out answers to my questions on my own. I have to decide which questions to include in my SI emails and they have to be yes/no answers, so I end up taking care of simple questions on my own. 

Here are examples of SI emails just to get you started: 

Example 1

Subject: Is my response to reviewer 2 ok with you?

Body: I’ve attached the paper/responses to referees.

Example 2

Subject: Can you send my letter of recommendation to john.doe@someplace.com?

Body:

Keywords = recommendation, Jeff, John Doe.

Example 3

Subject: I revised the draft to include your suggestions about simulations and language

Revisions attached. Let me know if you have any problems, otherwise I’ll submit Monday at 2pm. 

Sep 23, 201118 notes
#DIY #Email #Proposal #Humor
Dongle communism

If you have a mac and give talks or teach, chances are you have embarrassed yourself by forgetting your dongle. Our lab meetings and classes were constantly delayed due to missing dongles. Communism solved this problem. We bought 10 dongles, sprinkled them around the department, and declared all dongles public property. All dongles, not just the 10. No longer do we have to ask to borrow dongles because they have no owner. Please join the revolution. ps -I think this should apply to pens too!

image

Sep 23, 20112 notes
#Humor #Proposal
Most popular infographics

Thanks to Karl Broman via Andrew Gelman.

Sep 22, 2011
#Link
The Killer App for Peer Review

A little while ago, over at Genomes Unzipped, Joe Pickrell asked, “Why publish science in peer reviewed journals?” He points out the flaws with the current peer review system and suggests how we can do better. What he suggests is missing is the killer app for peer review. 

Well, PLoS has now developed an API, where you can easily access tons of data on the papers published in those journals including downloads, citations, number of social bookmarks, and mentions in major science blogs. Along with Mendeley a free reference manager, they have launched an competition to build cool apps with their free data. 

Seems like with the right statistical analysis/cool features a recommender system for say, PLoS One could have most of the features suggested by Joe in his article. One idea would be an RSS-feed based on an idea like the Pandora music sharing service. You input a couple of papers you like from the journal, then it creates an RSS feed with papers similar to that paper. 

Sep 22, 2011
#DIY #Proposal #Project Ideas
StatistiX

I think our field would attract more students if we changed the name to something ending with X or K. I’ve joked about this for years, but someone has actually done it (kind of):

http://www.bitlifesciences.com/AnalytiX2012/

Sep 22, 2011
#humor
Small ball is a bad strategy

Bill James pointed this out a long time ago. If you don’t know Bill James, you should look him up. I consider him to be one of the most influential statisticians of all times. This post relates to one of his first conjectures: sacrificing outs for runs, referred to as small ball, is a bad strategy. 

ESPN’s Gamecast, a webtool that gives you pitch-by-pitch updates of baseball games, also gives you a pitch-by-pitch “probability” of wining. Gamecast confirms the conjecure with data. How do they calculate this “probability”? I am pretty sure it is based only on historical data. No modeling. For example, if the away team is up 4-2 in the bottom of the 7th with no outs and runners on 1st and 2nd, they look at all the instances exactly like this one that have ever happened in the digitally recorded history of baseball and report the proportion of times the home team wins. Well in this situation this proportion is 45%. If the next batter successfully bunts, moving the runners over, this proportion drops to 41%.  Furthermore, if after the successful bunt, the run from third scores on a sacrifice fly, the proportion drops again from 41%  to 39%. The extra out hurts you more than the extra run helps you. That was Bill James’ intuition: you only have three outs so the last thing you want to do is give 33% away. 

Sep 21, 2011
#Baseball
MacArthur Fellow Shwetak Patel

The new MacArthur Fellows list is out and, as usual, they are an interesting bunch. One person that I thought was worth pointing out is Shwetak Patel. I had the privilege of meeting Shwetak at a National Research Council meeting on sustainability and computer science. Basically, he’s working on devices that you can install in your home to monitor your resource usage. He’s already spun-off a startup company to make/sell some of these devices. 

In the writeup for the award, they mention

When coupled with a machine learning algorithm that analyzes patterns of activity and the signature noise produced by each appliance, the sensors enable users to measure and disaggregate their energy and water consumption and to detect inefficiencies more effectively.

Now that’s statistics at work!

Sep 20, 2011
Next page →
2011 2012
  • January 30
  • February 39
  • March 34
  • April 19
  • May 24
  • June 24
  • July 48
  • August 30
  • September 29
  • October 26
  • November 14
  • December
2011 2012
  • January
  • February
  • March
  • April
  • May
  • June
  • July
  • August
  • September 44
  • October 42
  • November 30
  • December 25