This graph shows that President Obama’s proposed budget treats the NIH even worse than G.W. Bush - Sign the petition to increase NIH funding!

The NIH provides financial support for a large percentage of biological and medical research in the United States. This funding supports a large number of US jobs, creates new knowledge, and improves healthcare for everyone. So I am signing this petition


NIH funding is essential to our national research enterprise, to our local economies, to the retention and careers of talented and well-educated people, to the survival of our medical educational system, to our rapidly fading worldwide dominance in biomedical research, to job creation and preservation, to national economic viability, and to our national academic infrastructure.


The current administration is proposing a flat $30.7 billion FY 2013 NIH budget. The graph below (left) shows how small the NIH budget is in comparison to the Defense and Medicare budgets in absolute terms. The difference between the administration’s proposal and the petition’s proposal ($33 billion) are barely noticeable. 

The graph on the right shows how in 2003 growth in the NIH budget fell dramatically while medicare and military spending kept growing. However, despite the decrease in rate, the NIH budget did continue to increase under Bush. If we follow Bush’s post 2003 rate (dashed line), the 2013 budget will be about what the petition asks for: $33 billion.  


If you agree that the relatively modest increase in the NIH budget is worth the incredibly valuable biological, medical, and economic benefits this funding will provide, please consider signing the petition before April 15 

A plot of my citations in Google Scholar vs. Web of Science

There has been some discussion about whether Google Scholar or one of the proprietary software companies numbers are better for citation counts. I personally think Google Scholar is better for a number of reasons:

  1. Higher numbers, but consistently/adjustably higher :-)
  2. It’s free and the data are openly available. 
  3. It covers more ground (patents, theses, etc.) to give a better idea of global impact
  4. It’s easier to use

I haven’t seen a plot yet relating Web of Science citations to Google Scholar citations, so I made one for my papers.

GS has about 41% more citations per paper than Web of Science. That is consistent with what other people have found. It also looks reasonably linearish. I wonder what other people’s plots would look like? 

Here is the R code I used to generate the plot (the names are Pubmed IDs for the papers):

library(ggplot2)

names = c(16141318,16357033,16928955,17597765,17907809,19033188,19276151,19924215,20560929,20701754,20838408, 21186247,21685414,21747377,21931541,22031444,22087737,22096506,22257669) 

y = c(287,122,84,39,120,53,4,52,6,33,57,0,0,4,1,5,0,2,0)

x = c(200,92,48,31,79,29,4,51,2,18,44,0,0,1,0,2,0,1,0)

Year = c(2005,2006,2007,2007,2007,2008,2009,2009,2011,2010,2010,2011,2012,2011,2011,2011,2011,2011,2012)

q <- qplot(x,y,xlim=c(-20,300),ylim=c(-20,300),xlab=”Web of Knowledge”,ylab=”Google Scholar”) + geom_point(aes(colour=Year),size=5) + geom_line(aes(x = y, y = y),size=2)

Statistics project ideas for students

Here are a few ideas that might make for interesting student projects at all levels (from high-school to graduate school). I’d welcome ideas/suggestions/additions to the list as well. All of these ideas depend on free or scraped data, which means that anyone can work on them. I’ve given a ballpark difficulty for each project to give people some idea.

Happy data crunching!

Data Collection/Synthesis

  1. Creating a webpage that explains conceptual statistical issues like randomization, margin of error, overfitting, cross-validation, concepts in data visualization, sampling. The webpage should not use any math at all and should explain the concepts so a general audience could understand. Bonus points if you make short 30 second animated youtube clips that explain the concepts. (Difficulty: Lowish; Effort: Highish)
  2. Building an aggregator for statistics papers across disciplines that can be the central resource for statisticians. Journals ranging from PLoS Genetics to Neuroimage now routinely publish statistical papers. But there is no one central resource that aggregates all the statistics papers published across disciplines. Such a resource would be hugely useful to statisticians. You could build it using blogging software like Wordpress so articles could be tagged/you could put the resource in your RSS feeder. (Difficulty: Lowish; Effort: Mediumish)

Data Analyses

  1. Scrape the LivingSocial/Groupon sites for the daily deals and develop a prediction of how successful the deal will be based on location/price/type of deal. You could use either the RCurl R package or the XML R package to scrape the data. (Difficulty: Mediumish; Effort: Mediumish)
  2. You could use the data from your city (here are a few cities with open data) to: (a) identify the best and worst neighborhoods to live in based on different metrics like how many parks are within walking distance, crime statistics, etc. (b) identify concrete measures your city could take to improve different quality of life metrics like those described above - say where should the city put a park, or (c) see if you can predict when/where crimes will occur (like these guys did). (Difficulty: Mediumish; Effort: Highish)
  3. Download data on state of the union speeches from here and use the tm package in R to analyze the patterns of word use over time (Difficulty: Lowish; Effort: Lowish)
  4. Use this data set from Donors Choose to determine the characteristics that make the funding of projects more likely. You could send your results to the Donors Choose folks to help them improve the funding rate for their projects. (Difficulty: Mediumish; Effort: Mediumish
  5. Which basketball player would you want on your team? Here is a really simple analysis done by Rafa. But it doesn’t take into account things like defense. If you want to take on this project, you should take a look at this Denis Rodman analysis which is the gold standard. (Difficulty: Mediumish; Effort: Highish).

Data visualization

  1. Creating an R package that wraps the svgAnnotation package. This package can be used to create dynamic graphics in R, but is still a bit too flexible for most people to use. Writing some wrapper functions that simplify the interface would be potentially high impact. Maybe something like svgPlot() to create simple, dynamic graphics with only a few options (Difficulty: Mediumish; Effort: Mediumish). 
  2. The same as project 1 but for D3.js. The impact could potentially be a bit higher, since the graphics are a bit more professional, but the level of difficulty and effort would also both be higher. (Difficulty: Highish; Effort: Highish)

Prediction: the Lasso vs. just using the top 10 predictors

One incredibly popular tool for the analysis of high-dimensional data is the lasso. The lasso is commonly used in cases when you have many more predictors than independent samples (the n « p) problem. It is also often used in the context of prediction. 

Suppose you have an outcome Y and several predictors X1,…,XM, the lasso fits a model:

Y = B0 + B1 X1 + B2 X2 + … + BM XM + E

subject to a constraint on the sum of the absolute value of the B coefficients. The result is that: (1) some of the coefficients get set to zero, and those variables drop out of the model, (2) other coefficients are “shrunk” toward zero. Dropping some variables is good because there are a lot of potentially unimportant variables. Shrinking coefficients may be good, since the big coefficients might be just the ones that were really big by random chance (this is related to Andrew Gelman’s type M errors). 

I work in genomics, where n«p problems come up all the time. Whenever I use the lasso or when I read papers where the lasso is used for prediction, I always think: “How does this compare to just using the top 10 most significant predictors?” I have asked this out loud enough that some people around here started calling it the “Leekasso” to poke fun at me. So I’m going to call it that in a thinly veiled attempt to avoid Stigler’s law of eponymy (actually Rafa points out that using this name is a perfect example of this law, since this feature selection approach has been proposed before at least once). 

Here is how the Leekasso works. You fit each of the models:

Y = B0 + BkXk + E

take the 10 variables with the smallest p-values from testing the Bk coefficients, then fit a linear model with just those 10 coefficients. You never use 9 or 11, the Leekasso is always 10. 

For fun I did an experiment to compare the accuracy of the Leekasso and the Lasso.

Here is the setup:

  • I simulated 500 variables and 100 samples for each study, each N(0,1)
  • I created an outcome that was 0 for the first 50 samples, 1 for the last 50
  • I set a certain number of variables (between 5 and 50) to be associated with the outcome using the model Xi = b0i + b1iY + e (this is an important choice, more later in the post) 
  • I tried different levels of signal to the truly predictive features
  • I generated two data sets (training and test) from the exact same model for each scenario
  • I fit the Lasso using the lars package, choosing the shrinkage parameter as the value that minimized the cross-validation MSE in the training set
  • I fit the Leekasso and the Lasso on the training sets and evaluated accuracy on the test sets. 

The R code for this analysis is available here and the resulting data is here.

The results show that for all configurations, using the top 10 has a higher out of sample prediction accuracy than the lasso. A larger version of the plot is here

Interestingly, this is true even when there are fewer than 10 real features in the data or when there are many more than 10 real features ((remember the Leekasso always picks 10). 

Some thoughts on this analysis:

  1. This is only test-set prediction accuracy, it says nothing about selecting the “right” features for prediction. 
  2. The Leekasso took about 0.03 seconds to fit and test per data set compared to about 5.61 seconds for the Lasso.
  3. The data generating model is the model underlying the top 10, so it isn’t surprising it has higher performance. Note that I simulated from the model: Xi = b0i + b1iY + e, this is the model commonly assumed in differential expression analysis (genomics) or voxel-wise analysis (fMRI). Alternatively I could have simulated from the model: Y = B0 + B1 X1 + B2 X2 + … + BM XM + E, where most of the coefficients are zero. In this case, the Lasso would outperform the top 10 (data not shown). This is a key, and possibly obvious, issue raised by this simulation. When doing prediction differences in the true “causal” model matter a lot. So if we believe the “top 10 model” holds in many high-dimensional settings, then it may be the case that regularization approaches don’t work well for prediction and vice versa. 
  4. I think what may be happening is that the Lasso is overshrinking the parameter estimates, in other words, you give up too much bias for a gain in variance. Alan Dabney and John Storey have a really nice paper discussing shrinkage in the context of genomic prediction that I think is related. 


An R script for estimating future inflation via the Treasury market

One factor that is critical for any financial planning is estimating what future inflation will be. For example, if you’re saving money in an instrument that gains 3% per year, and inflation is estimated to be 4% per year, well then you’re losing money in real terms.

There are a variety of ways to estimate the rate of future inflation. You could, for example, use past rates as an estimate of future rates. However, the Treasury market provides an estimate of what the market thinks annual inflation will be over the next 5, 10, 20, and 30 years.

Basically, the Treasury issue two types of securities: nominal securities that pay a nominal interest rate (fixed percentage of your principal), and inflation-indexed securities (TIPS) that pay an interest rate that is applied to your principal adjusted by the consumer price index (CPI). As the CPI goes up and down, the payments for inflation-indexed securities go up and down (although they can’t go negative so you always get your principal back). As these securities trade throughout the day, their respective market-based interest rates go up and down continuously. The difference between the nominal interest rate and the real interest rate for a fixed period of time (5, 10, 20, years)  can be used as a rough estimate of annual inflation over that time period.

The Treasury publishes data for its auctions everyday on the yield for both nominal and inflation-indexed securities. There is an XML feed for nominal yields and for real yields. Using these, I used the XML R package and wrote an R script to scrape the data and calculate the inflation estimate.  

As of today, the market’s estimate of annual inflation is:

5-year Inflation: 1.88%
10-year Inflation: 2.18%
30-year Inflation: 2.38%

Basically, you just call the ‘inflation()’ function with no arguments and it produces the above print out.

Tags: R finance

Why don’t we hear more about Adrian Dantley on ESPN? This graph makes me think he was as good an offensive player as Michael Jordan.

In my last post I complained about efficiency not being discussed enough by NBA announcers and commentators. I pointed out that some of the best scorers have relatively low FG% or TS%. However, via the comments it was pointed out that top scorers need to take more difficult shots and thus are expected to have lower efficiency. The plot below (made with this R script) seems to confirm this (click image to enlarge) . The dashed line is from regression and the colors represent guards (green), forwards (orange) and centers (purple).

Among this group TS% does trend down with points per game and centers tend to have higher TS%. Forwards and guards are not very different. However, the plot confirms that some of the supposed all time greats are more ball hogs than good scorers. 

A couple of  further observations. First, Adrian Dantley was way better than I thought. Why isn’t he more famous? Second, Kobe is no Jordan. Also note Jordan played several seasons past his prime which lowered his career averages. So I added points for five of these players using only data from their prime years (ages 24-29). Here Jordan really stands out. But so does Dantley! 

pd - Note that these plots say nothing about defense, rebounding, or passing. This in-depth analysis makes a convincing argument that Dennis Rodman is one of the most valuable players of all time.

Tags: R Basketball

This graph makes me think Kobe is not that good, he just shoots a lot

I find it surprising that NBA commentators rarely talk about field goal percentage. Everybody knows that the more you shoot the more you score. But players that score a lot are admired without consideration of their FG%. Of course having a high FG% is not necessarily admirable as many players only take easy shots, while top-scorers need to take difficult ones. Regardless, missing is undesirable and players that miss more than usual are not criticized enough. Iverson, for example, had a lowly career FG% of 43 yet he regularly made the allstar team. But I am not surprised he never won an NBA championship: it’s hard to win when your top scorer misses so often.

Experts consider Kobe to be one of the all time greats and compare him to Jordan. They never mention that he is consistently among league leaders in missed shots. So far this year, Kobe has missed a whopping 279 times for a league leading 13.3 misses per game. In contrast, Lebron has missed 8.8 per game and has scored about the same per game. The plot above (made with this R script) shows career FG% for players considered to be superstars, top-scorers, and that have won multiple championships (red lines are 1st and 3rd quartiles). I also include Gasol, Lebron, Wade, and Dominique. Note that Kobe has the worst FG% in this group.  So how does he win 5 championships? Well perhaps Shaq and later Gasol made up for his misses. Note that the first year Kobe played without Shaq, the Lakers did not make the playoffs. Also, during Kobe’s career the Lakers’ record has been similar with and without him. Experts may compare Kobe to Jordan, but perhaps we should be comparing him to Dominique.

Update: Please see Brunsloe87’s comment for a much better analysis than mine. He/she points out that it’s too simplistic to look at FG%. Instead we should look at something closer to points scored per shot taken. This rewards players, like Kobe, that draw many fouls and has a high FT%. There is a weighted statistic called true scoring % (TS%) that tries to summarize this and below I include a plot of TS% for the same players. Kobe is no Jordan but he is not as bad as Dominique either. He is somewhere in the middle. 

The comment also points out that Magic didn’t shoot as much as other superstars so it’s unfair to include him. A better plot would plot TS% versus shots taken (e.g. FGA+FTA/2) but I’ll let someone with more time make that one. Anyways, this plot explains why the early 80s Lakers (Magic+Kareem) were so good.

Tags: R Basketball

A wordcloud comparison of the 2011 and 2012 #SOTU

I wrote a quick (and very dirty) R script for creating a comparison cloud and a commonality cloud for President Obama’s 2011 and 2012 State of the Union speeches*. The cloud on the left shows words that have different frequencies between the two speeches and the cloud on the right shows the words in common between the two speeches. Here is a higher resolution version. 

The focus on jobs hasn’t changed much. But it is interesting how the 2012 speech seems to focus more on practical issues (tax, pay, manufacturing, oil) versus more emotional issues in 2011 (future, schools, laughter, success, dream). 

*The wordcloud R package does all the heavy lifting.

Baltimore gun offenders and where academics don’t live

Jeff recently posted links to data from cities and states. He and I wrote R code that plots gun offender locations for Baltimore. Specifically we plot the locations that appear on this table. I added locations of the Baltimore neighborhoods where most of our Hopkins colleagues live as well as the location of the medical institutions where we work. Note the corridor with no points between the West side (Barksdale territory) and East side (Prop Joe territory). Not surprisingly, academics don’t live near the gun offenders. 

An R function to map your Twitter Followers

I wrote a little function to make a personalized map of who follows you or who you follow on Twitter. The idea for this function was inspired by some plots I discussed in a previous post. I also found a lot of really useful code over at flowing data here

The function uses the packages twitteR, maps, geosphere, and RColorBrewer. If you don’t have the packages installed, when you source the twitterMap code, it will try to install them for you. The code also requires you to have a working internet connection. 

One word of warning is that if you have a large number of followers or people you follow, you may be rate limited by Twitter and unable to make the plot.

To make your personalized twitter map, first source the function:

> source(“http://biostat.jhsph.edu/~jleek/code/twitterMap.R”)

The function has the following form: 

twitterMap <- function(userName,userLocation=NULL,fileName=”twitterMap.pdf”,nMax = 1000,plotType=c(“followers”,”both”,”following”))

with arguments:

  • userName - the twitter username you want to plot
  • userLocation - an optional argument giving the location of the user, necessary when the location information you have provided Twitter isn’t sufficient for us to find latitude/longitude data
  • fileName - the file where you want the plot to appear
  • nMax - The maximum number of followers/following to get from Twitter, this is implemented to avoid rate limiting for people with large numbers of followers. 
  • plotType - if “both” both followers/following are plotted, etc. 

Then you can create a plot with both followers/following like so: 

> twitterMap(“simplystats”)

Here is what the resulting plot looks like for our Twitter Account:

If your location can’t be found or latitude longitude can’t be calculated, you may have to chose a bigger city near you. The list of cities used by twitterMap can be found like so:

>library(maps)

>data(world.cities)

>grep(“Baltimore”, world.cities[,1])

If your city is in the database, this will return the row number of the world.cities data frame corresponding to your city. 

If you like this function you may also like our function to determine if you are a data scientist or to analyze your Google Scholar citations page.
Update: The bulk of the heavy lifting done by these functions is performed by Jeff Gentry’s very nice twitteR package and code put together by Nathan Yau over at FlowingData. This is really an example of standing on the shoulders of giants.