Schlep blindness in statistics

This is yet another outstanding post by Paul Graham, this time on “Schlep Blindness”. He talks about how there are great startup ideas that no one considers because they are too much of a “schlep” (a tedious unpleasant task). He talks about how most founders of startups want to put up a clever bit of code they wrote and just watch the money flow in. But of course it doesn’t work like that, you need to advertise, interact with customers, raise money, go out and promote your work, fix bugs at 3am, etc. 

In academia there is a similar tendency to avoid projects that involve a big schlep. For example, it is relatively straightforward to develop a mathematical model, work out the parameter estimates, and write a paper. But it is a big schlep to then write fast code that implements that method, debug the code, dummy proof the code, fix bugs submitted by users, etc. Rafa’s post, Hadley’s interview, and the discussion Rafa linked to all allude to this issue. Particularly the fact that the schlep, the long slow slog of going through a new data type or writing a piece of usable software is somewhat undervalued. 

I think part of the problem is our academic culture and heritage, which has traditionally put a very high premium on being clever and a relatively low premium on being willing to go through the schlep. As applied statistics touches more areas and the number of users of statistical software and ideas grows, the schlep becomes just as important as the clever idea. If you aren’t willing to put in the time to code your methods up and make them accessible to other investigators, then who will be? 

To bring this back to the discussion inspired by Rafa’s post, I wonder if applied statistics journals could increase their impact, encourage more readership from scientific folks, and support a broader range of applied statisticians if there was a re-weighting of the importance of cleverness and schlep? As Paul points out: 

 In addition to their intrinsic value, they’re like undervalued stocks in the sense that there’s less demand for them among founders. If you pick an ambitious idea, you’ll have less competition, because everyone else will have been frightened off by the challenges involved.

Sunday data/statistics link roundup (5/27)

  1. Amanda Cox on the process they went through to come up with this graphic about the Facebook IPO. So cool to see how R is used in the development process. A favorite quote of mine, “But rather than bringing clarity, it just sort of looked chaotic, even to the seasoned chart freaks of 620 8th Avenue.” One of the more interesting things about posts like this is you get to see how statistics versus a deadline works. This is typically the role of the analyst, since they come in late and there is usually a deadline looming…
  2. An interview with Steve Blank about Silicon valley and how venture capitalists (VC’s) are focused on social technologies since they can make a profit quickly. A depressing/fascinating quote from this one is, “If I have a choice of investing in a blockbuster cancer drug that will pay me nothing for ten years,  at best, whereas social media will go big in two years, what do you think I’m going to pick? If you’re a VC firm, you’re tossing out your life science division.” He also goes on to say thank goodness for the NIH, NSF, and Google who are funding interesting “real science” problems. This probably deserves its own post later in the week, the difference between analyzing data because it will make money and analyzing data to solve a hard science problem. The latter usually takes way more patience and the data take much longer to collect. 
  3. An interesting post on how Obama’s analytics department ran an A/B test which improved the number of people who signed up for his mailing list. I don’t necessarily agree with their claim that they helped raise $60 million, there may be some confounding factors that mean that the individuals who sign up with the best combination of image/button don’t necessarily donate as much. But still, an interesting look into why Obama needs statisticians
  4. A cute statistics cartoon from @kristin_linn  via Chris V. Yes, we are now shamelessly reposting cute cartoons for retweets :-). 
  5. Rafa’s post inspired some interesting conversation both on our blog and on some statistics mailing lists. It seems to me that everyone is making an effort to understand the increasingly diverse field of statistics, but we still have a ways to go. I’m particularly interested in discussion on how we evaluate the contribution/effort behind making good and usable academic software. I think the strength of the Bioconductor community and the rise of Github among academics are a good start.  For example, it is really useful that Bioconductor now tracks the number of package downloads

“How do we evaluate statisticians working in genomics? Why don’t they publish in stats journals?” Here is my answer

During the past couple of years I have been asked these questions by several department chairs and other senior statisticians interested in hiring or promoting faculty working in genomics. The main difficulty stems from the fact that we (statisticians working in genomics) publish in journals outside the mainstream statistical journals. This can be a problem during evaluation because a quick-and-dirty approach to evaluating an academic statistician is to count papers in the Annals of Statistics, JASA, JRSS and Biometrics. The evaluators feel safe counting these papers because they trust the fellow-statistician editors of these journals. However, statisticians working in genomics tend to publish in journals like Nature Genetics, Genome Research, PNAS, Nature Methods, Nucleic Acids Research, Genome Biology, and Bioinformatics. In general, these journals do not recruit statistical referees and a considerable number of papers with questionable statistics do get published in them. However, when the paper’s main topic is a statistical method or if it heavily relies on statistical methods, statistical referees are used. So, if the statistician is the corresponding or last author and it’s a stats paper, it is OK to assume the statistics are fine and you should go ahead and be impressed by the impact factor of the journal… it’s not east getting statistics papers in these journals. 

But we really should not be counting papers blindly. Instead we should be reading at least some of them. But here again the evaluators get stuck as we tend to publish papers with application/technology specific jargon and show-off by presenting results that are of interest to our potential users (biologists) and not necessarily to our fellow statisticians. Here all I can recommend is that you seek help. There are now a handful of us that are full professors and most of us are more than willing to help out with, for example, promotion letters.

So why don’t we publish in statistical journals? The fear of getting scooped due to the slow turnaround of stats journals is only one reason. New technologies that quickly became widely used (microarrays in 2000 and nextgen sequencing today) created a need for data analysis methods among large groups of biologists. Journals with large readerships and high impact factors, typically not interested in straight statistical methodology work, suddenly became amenable to publishing our papers, especially if they solved a data analytic problem faced by many biologists. The possibility of publishing in widely read journals is certainly seductive. 

While in several other fields, data analysis methodology development is restricted to the statistics discipline, in genomics we compete with other quantitative scientists capable of developing useful solutions: computer scientists, physicists, and engineers were also seduced by the possibility of gaining notoriety with publications in high impact journals. Thus, in genomics, the competition for funding, citation and publication in the top scientific journals is fierce. 

Then there is funding. Note that while most biostatistics methodology NIH proposals go to the Biostatistical Methods and Research Design (BMRD) study section, many of the genomics related grants get sent to other sections such as the Genomics Computational Biology and Technology (GCAT) and Biodata Management and Anlayis (BDMA) study sections. BDMA and GCAT are much more impressed by Nature Genetics and Genome Research than JASA and Biometrics. They also look for citations and software downloads. 

To be considered successful by our peers in genomics, those who referee our papers and review our grant applications, our statistical methods need to be delivered as software and garner a user base. Publications in statistical journals, especially those not appearing in PubMed, are not rewarded. This lack of incentive combined with how time consuming it is to produce and maintain usable software, has led many statisticians working in genomics to focus solely on the development of practical methods rather than generalizable mathematical theory. As a result, statisticians working in genomics do not publish much in the traditional statistical journals. You should not hold this against them, especially if they are developers and maintainers of widely used software.

Sunday data/statistics link roundup (5/20)

It’s grant season around here so I’ll be brief:
  1. I love this article in the WSJ about the crisis at JP Morgan. The key point it highlights is that looking only at the high-level analysis and summaries can be misleading, you have to look at the raw data to see the potential problems. As data become more complex, I think its critical we stay in touch with the raw data, regardless of discipline. At least if I miss something in the raw data I don’t lose a couple billion. Spotted by Leonid K. 
  2. On the other hand, this article in the Times drives me a little bonkers. It makes it sound like there is one mathematical model that will solve the obesity epidemic. Lines like this are ridiculous: “Because to do this experimentally would take years. You could find out much more quickly if you did the math.” The obesity epidemic is due to a complex interplay of cultural, sociological, economic, and policy factors. The idea you could “figure it out” with a set of simple equations is laughable. If you check out their model this is clearly not the answer to the obesity epidemic. Just another example of why statistics is not math. If you don’t want to hopelessly oversimplify the problem, you need careful data collection, analysis, and interpretation. For a broader look at this problem, check out this article on Science vs. PR. Via Andrew J. 
  3. Some cool applications of the raster package in R. This kind of thing is fun for student projects because analyzing images leads to results that are easy to interpret/visualize.
  4. Check out John C.’s really fascinating post on determining when a white-collar worker is great. Inspired by Roger’s post on knowing when someone is good at data analysis. 

The West Wing was always a favorite show of mine (at least, seasons 1-4, the Sorkin years) and I think this is a great scene which talks about the difference between evidence and interpretation. The topic is a 5-day waiting period for gun purchases and they’ve just received a poll in a few specific congressional districts showing weak support for this proposed policy.

A reminder that the data we analyze are almost always not the data we want.

Hans Rosling, a professor of international health at the Karolinska Institute in Sweden, believes embracing this uncertainty, rather than seeking to banish it, is vital when working with global health data.

“I’ve spent the last 10 years thinking about nothing else,” he said in a telephone interview. “And yes it’s absolutely essential that we continue to collect data and publish it.”

Rosling, who trained in statistics and medicine before starting the Gapminder Foundation, a non-profit venture which aims to depict the state of the world in numbers, says we must recognize that some things are easy to measure. He gives women’s fertility rates as an example, Others, like malaria deaths, are far more difficult.

“It’s not just that good data are hard to find, as that uncertainty is hard to assess,” he said.

As the company goes public, it has to figure out how to use its vault of information to enrich its shareholders.

UPDATE: As of today, it looks like that data may be worth about $104 billion.

Computational biologist blogger saves computer science department

People who read the news should be aware by now that we are in the midst of a big data era. The New York Times, for example, has been writing about this frequently. One of their most recent articles describes how UC Berkeley is getting $60 million dollars for a new computer science center. Meanwhile, at University of Florida the administration seems to be oblivious to all this and about a month ago announced it was dropping its computer science department to save $. Blogger Steven Salzberg, a computational biologists known for his work in genomics, wrote a post titled “University of Florida eliminates Computer Science Department. At least they still have football” ridiculing UF for their decisions. Here are my favorite quotes:

 in the midst of a technology revolution, with a shortage of engineers and computer scientists, UF decides to cut computer science completely? 

Computer scientist Carl de Boor, a member of the National Academy of Sciences and winner of the 2003 National Medal of Science, asked the UF president “What were you thinking?”

Well, his post went viral and days later UF reversed it’s decision! So my point is this: statistics departments, be nice to bloggers that work in genomics… one of them might save your butt some day.

Disclaimer: Steven Salzberg has a joint appointment in my department and we have joint lab meetings.

Sunday data/statistics link roundup (5/13)

  1. Patenting statistical sampling? I’m pretty sure the Supreme Court who threw out the Mayo Patent wouldn’t have much trouble tossing this patent either. The properties of sampling are a “law of nature” right? via Leonid K.
  2. This video has me all fired up, its called 23 1/2 hours and talks about how the best preventative health measure is getting 30 minutes of exercise - just walking - every day. He shows how in some cases this beats doing much more high-tech interventions. My favorite part of this video is how he uses a ton of statistical/epidemiological terms like “effect sizes”, “meta-analysis”, “longitudinal study”, “attributable fractions”, but makes them understandable to a broad audience. This is a great example of “statistics for good”.
  3. A very nice collection of 2-minute tutorials in R. This is a great way to teach the concepts, most of which don’t need more than 2 minutes, and it covers a lot of ground. One thing that drives me crazy is when I go into Rafa’s office with a hairy computational problem and he says, “Oh you didn’t know about function x?”. Of course this only happens after I’ve wasted an hour re-inventing the wheel. If more people put up 2 minute tutorials on all the cool tricks they know, the better we’d all be.
  4. A plot using ggplot2, developed by this week’s interviewee Hadley Wickham appears in the Atlantic! Via David S.
  5. I’m refusing to buy into Apple’s hegemony, so I’m still running OS 10.5. I’m having trouble getting github up and running. Anyone have this same problem/know a solution? I know, I know, I’m way behind the times on this…

Interview with Hadley Wickham - Developer of ggplot2

Hadley Wickham



Hadley Wickham is the Dobelman Family Junior Chair of Statistics at Rice University. Prior to moving to Rice, he completed his Ph.D. in Statistics from Iowa State University. He is the developer of the wildly popular ggplot2 software for data visualization and a contributor to the Ggobi project. He has developed a number of really useful R packages touching everything from data processing, to data modeling, to visualization. 

Which term applies to you: data scientist, statistician, computer
scientist, or something else?

I’m an assistant professor of statistics, so I at least partly
associate with statistics :).  But the idea of data science really
resonates with me: I like the combination of tools from statistics and
computer science, data analysis and hacking, with the core goal of
developing a better understanding of data. Sometimes it seems like not
much statistics research is actually about gaining insight into data.


You have created/maintain several widely used R packages. Can you
describe the unique challenges to writing and maintaining packages
above and beyond developing the methods themselves?

I think there are two main challenges: turning ideas into code, and
documentation and community building.

Compared to other languages, the software development infrastructure
in R is weak, which sometimes makes it harder than necessary to turn
my ideas into code. Additionally, I get less and less time to do
software development, so I can’t afford to waste time recreating old
bugs, or releasing packages that don’t work. Recently, I’ve been
investing time in helping build better dev infrastructure; better
tools for documentation [roxygen2], unit testing [testthat], package development [devtools], and creating package website [staticdocs]. Generally, I’ve
found unit tests to be a worthwhile investment: they ensure you never
accidentally recreate an old bug, and give you more confidence when
radically changing the implementation of a function.

Documenting code is hard work, and it’s certainly something I haven’t
mastered. But documentation is absolutely crucial if you want people
to use your work. I find the main challenge is putting yourself in the
mind of the new user: what do they need to know to use the package
effectively. This is really hard to do as a package author because
you’ve internalised both the motivating problem and many of the common
solutions.

Connected to documentation is building up a community around your
work. This is important to get feedback on your package, and can be
helpful for reducing the support burden. One of the things I’m most
proud of about ggplot2 is something that I’m barely responsible for:
the ggplot2 mailing list. There are now ggplot2 experts who answer far
more questions on the list than I do. I’ve also found github to be
great: there’s an increasing community of users proficient in both R
and git who produce pull requests that fix bugs and add new features.

The flip side of building a community is that as your work becomes
more popular you need to be more careful when releasing new versions.
The last major release of ggplot2 (0.9.0) broke over 40 (!!) CRAN
packages, and forced me to rethink my release process. Now I advertise
releases a month in advance, and run `R CMD check` on all downstream
dependencies (`devtools::revdep_check` in the development version), so
I can pick up potential problems and give other maintainers time to
fix any issues.


Do you feel that the academic culture has caught up with and supports
non-traditional academic contributions (e.g. R packages instead of
papers)?

It’s hard to tell. I think it’s getting better, but it’s still hard to
get recognition that software development is an intellectual activity
in the same way that developing a new mathematical theorem is. I try
to hedge my bets by publishing papers to accompany my major packages:
I’ve also found the peer-review process very useful for improving the
quality of my software. Reviewers from both the R journal and the
Journal of Statistical Software have provided excellent suggestions
for enhancements to my code.


You have given presentations at several start-up and tech companies.
Do the corporate users of your software have different interests than
the academic users?

By and large, no. Everyone, regardless of domain, is struggling to
understand ever larger datasets. Across both industry and academia,
practitioners are worried about reproducible research and thinking
about how to apply the principles of software engineering to data
analysis.


You gave one of my favorite presentations called Tidy Data/Tidy Tools
at the NYC Open Statistical Computing Meetup. What are the key
elements of tidy data that all applied statisticians should know?

Thanks! Basically, make sure you store your data in a consistent
format, and pick (or develop) tools that work with that data format.
The more time you spend munging data in the middle of an analysis, the
less time you have to discover interesting things in your data. I’ve
tried to develop a consistent philosophy of data that means when you
use my packages (particularly plyr and ggplot2), you can focus on the
data analysis, not on the details of the data format. The principles
of tidy data that I adhere to are that every column should be a
variable, every row an observation, and different types of data should
live in different data frames. (If you’re familiar with database
normalisation this should sound pretty familiar!). I expound these
principles in depth in my in-progress [paper on the
topic]


How do you decide what project to work on next? Is your work inspired
by a particular application or more general problems you are trying to
tackle?

Very broadly, I’m interested in the whole process of data analysis:
the process that takes raw data and converts it into understanding,
knowledge and insight. I’ve identified three families of tools
(manipulation, modelling and visualisation) that are used in every
data analysis, and I’m interested both in developing better individual
tools, but also smoothing the transition between them. In every good
data analysis, you must iterate multiple times between manipulation,
modelling and visualisation, and anything you can do to make that
iteration faster yields qualitative improvements to the final analysis
(that was one of the driving reasons I’ve been working on tidy data).

Another factor that motivates a lot of my work is teaching. I hate
having to teach a topic that’s just a collection of special cases,
with no underlying theme or theory. That drive lead to [stringr] (for
string manipulation) and [lubridate] (with Garrett Grolemund for working
with dates). I recently released the [httr] package which aims to do a similar thing for http requests - I think this is particularly important as more and more data starts living on the web and must be accessed through an API.


What do you see as the biggest open challenges in data visualization
right now? Do you see interactive graphics becoming more commonplace?

I think one of the biggest challenges for data visualisation is just
communicating what we know about good graphics. The first article
decrying 3d bar charts was published in 1951! Many plots still use
rainbow scales or red-green colour contrasts, even though we’ve known
for decades that those are bad. How can we ensure that people
producing graphics know enough to do a good job, without making them
read hundreds of papers? It’s a really hard problem.

Another big challenge is balancing the tension between exploration and
presentation. For explotary graphics, you want to spend five seconds
(or less) to create a plot that helps you understand the data, while you might spend
five hours on a plot that’s persuasive to an audience who
isn’t as intimately familiar with the data as you. To date, we have
great interactive graphics solutions at either end of the spectrum
(e.g. ggobi/iplots/manet vs d3) but not much that transitions from one
end of the spectrum to the other. This summer I’ll be spending some
time thinking about what ggplot2 + [d3], might
equal, and how we can design something like an interactive grammar of
graphics that lets you explore data in R, while making it easy to
publish interaction presentation graphics on the web.