Sunday Data/Statistics Link Roundup (10/7/12)

  1. Jack Welch got a little conspiracy-theory crazy with the job numbers. Thomas Lumley over at StatsChat makes a pretty good case for debunking the theory. I think the real take home message of Thomas’ post and one worth celebrating/highlighting is that agencies that produce the jobs report do so based on a fixed and well-defined study design. Careful efforts by government statistics agencies make it hard to fudge/change the numbers. This is an underrated and hugely important component of a well-run democracy. 
  2. On a similar note Dan Gardner at the Ottawa Citizen points out that evidence-based policy making is actually not enough. He points out the critical problem with evidence: in the era of data what is a fact? “Facts” can come from flawed or biased studies just as easily from strong studies. He suggests that a true “evidence based” administration would invest more money in research/statistical agencies. I think this is a great idea. 
  3. An interesting article by Ben Bernanke suggesting that an optimal approach (in baseball and in policy) is one based on statistical analysis, coupled with careful thinking about long-term versus short-term strategy. I think one of his arguments about allowing players to play even when they are struggling short term is actually a case for letting the weak law of large numbers play out. If you have a player with skill/talent, they will eventually converge to their “true” numbers. It’s also good for their confidence….(via David Santiago).
  4. Here is another interesting peer review dust-up. It explains why some journals “reject” papers when they really mean major/minor revision to be able to push down their review times. I think this highlights yet another problem with pre-publication peer review. The evidence is mounting, but I hear we may get a defense of the current system from one of the editors of this blog, so stay tuned…
  5. Several people (Sherri R., Alex N., many folks on Twitter) have pointed me to this article about gender bias in science. I initially was a bit skeptical of such a strong effect across a broad range of demographic variables. After reading the supplemental material carefully, it is clear I was wrong. It is a very well designed/executed study and suggests that there is still a strong gender bias in science, across ages and disciplines. Interestingly both men and women were biased against the female candidates. This is clearly a non-trivial problem to solve and needs a lot more work, maybe one step is to make recruitment packages more flexible (see the comment by Allison T. especially). 

Should we stop publishing peer-reviewed papers?

Nate Silver, everyone’s favorite statistician made good, just gave an interview where he said he thinks many journal articles should be blog posts. I have been thinking about this same issue for a while now, and I’m not the only one. This is a really interesting post suggesting that although scientific journals once facilitated dissemination of ideas, they now impede the flow of information and make it more expensive. 

Two recent examples really drove this message home for me. In the first example, I posted a quick idea called the Leekasso, which led to some discussion on the blog, has nearly 2,000 page views (a pretty recent number of downloads for a paper), and has been implemented in software by someone other than me. If this were one of my papers, it would be one of the more reasonably high impact papers. The second example is a post I put up about a recent Nature paper. The authors (who are really good sports) ended up writing to me to get my critiques. I wrote them out, and they responded. All of this happened after peer review and informally. All of the interaction also occurred in email, where no one can see but us. 

It wouldn’t take much to go to a blog-based system. What if everyone who was publishing scientific results started a blog (free), then there was a site, run by pubmed, that aggregated the feeds (this would be super cheap to set up/maintain). Then people could comment on blog posts and vote for ones they liked if they had verified accounts. We skipped peer review in favor of just producing results and discussing them. The results that were interesting were shared by email, Twitter, etc. 

Why would we do this? Well, the current journal system: (1) significantly slows the publication of research, (2) costs thousands of dollars, and (3) costs significant labor that is not scientifically productive (such as resubmitting). 

Almost every paper I have had published has been rejected at least one place, including the “good” ones. This means that the results of even the good papers have been delayed by months. Or in the case of one paper - a full year and a half of delay. Any time I publish open access, it costs me at minimum around $1,500. I like open access because I think science funded by taxpayers should be free. But it is a significant drain on the resources of my group. Finally, most of the resubmission process is wasted labor. It generally doesn’t produce new science or improve the quality of the science. The effort is just in reformatting and re-inputing information about authors.

So why not have everyone just post results on their blog/figshare. They’d have a DOI that could be cited. We’d reduce everyone’s labor in reviewing/editing/resubmitting by an order of magnitude or two and save the taxpapers a few thousand dollars each a year in publication fees. We’d also increase the speed of updating/reacting to new ideas by an order of magnitude. 

I still maintain we should be evaluating people based on reading their actual work, not  highly subjective and error-prone indices. But if the powers that be insisted, it would be easy to evaluate people based on likes/downloads/citations/discussion of papers rather than on the basis of journal titles and the arbitrary decisions of editors. 

So should we stop publishing peer review papers?

Edit: Titus points to a couple of good posts with interesting ideas about the peer review process that are worth reading, here and here. Also, Joe Pickrell et al. are already on this for population genetics, having set up the aggregator Haldane’s Sieve. It would be nice if this expanded to other areas (and people got credit for the papers published there, like they do for papers in journals). 

Pro-tips for graduate students (Part 3)

This is part of the ongoing series of pro tips for graduate students, check out parts one and two for the original installments. 

  1. Learn how to write papers in a very clear and simple style. Whenever you can write in plain English, skip jargon as much as possible, and make the approach you are using understandable and clear. This can (sometimes) make it harder to get your papers into journals. But simple, clear language leads to much higher use/citation of your work. Examples of great writers are: John Storey, Rob Tibshirani, Robert May, Martin Nowak, etc.
  2. It is a great idea to start reviewing papers as a graduate student. Don’t do too many, you should focus on your research, but doing a few will give you a lot of insight into how the peer-review system works. Ask your advisor/research mentor they will generally have a review or two they could use help with. When doing reviews, keep in mind a person spent a large chunk of time working on the paper you are reviewing. Also, don’t forget to use Google.
  3. Try to write your first paper as soon as you possibly can and try to do as much of it on your own as you can. You don’t have to wait for faculty to give you ideas, read papers and think of what you think would have been better (you might check with a faculty member first so you don’t repeat what’s done, etc.). You will learn more writing your first paper than in almost any/all classes.

My worst (recent) experience with peer review

My colleagues and I just published a paper on validation of genomic results in BMC Bioinformatics. It is "highly accessed" and we are really happy with how it turned out. 

But it was brutal getting it published. Here is the line-up of places I sent the paper. 

  • Science: Submitted 10/6/10, rejected 10/18/10 without review. I know this seems like a long shot, but this paper on validation was published in Science not too long after. 
  • Nature Methods: Submitted 10/20/10, rejected 10/28/10 without review. Not much to say here, moving on…
  • Genome Biology: Submitted 11/1/10, rejected 1/5/11. 2/3 referees thought the paper was interesting, few specific concerns raised. I felt they could be addressed so appealed on 1/10/11, appeal accepted 1/20/11, paper resubmitted 1/21/11. Paper rejected 2/25/11. 2/3 referees were happy with the revisions. One still didn’t like it. 
  • Bioinformatics: Submitted 3/3/11, rejected 3/1311 without review. I appealed again, it turns out “I have checked with the editors about this for you and their opinion was that there was already substantial work in validating gene lists based on random sampling.” If anyone knows about one of those papers let me know :-). 
  • Nucleic Acids Research: Submitted 3/18/11, rejected with invitation for revision 3/22/11. Resubmitted 12/15/11 (got delayed by a few projects here) rejected 1/25/12. Reason for rejection seemed to be one referee had major “philosophical issues” with the paper.
  • BMC Bioinformatics: Submitted 1/31/12, first review 3/23/12, resubmitted 4/27/12, second revision requested 5/23/12, revised version submitted 5/25/12, accepted 6/14/12. 
An interesting side note is the really brief reviews from the Genome Biology submission inspired me to do this paper. I had time to conceive the study, get IRB approval, build a web game for peer review, recruit subjects, collect the data, analyze the data, write the paper, submit the paper to 3 journals and have it come out 6 months before the paper that inspired it was published! 
Ok, glad I got that off my chest.
What is your worst peer-review story?

Sunday data/statistics link roundup (6/17)

Happy Father’s Day!

  1. A really interesting read on randomized controlled trials (RCTs) and public policy. The examples in the boxes are fantastic. This seems to be one of the cases where the public policy folks are borrowing ideas from Biostatistics, which has been involved in randomized controlled trials for a long time. It’s a cool example of adapting good ideas in one discipline to the specific challenges of another. 
  2. Roger points to this link in the NY Times about the “Consumer Genome”, which basically is a collection of information about your purchases and consumer history. On Twitter, Leonid K. asks: ‘Since when has “genome” becaome a generic term for “a bunch of information”?’. I completely understand the reaction against the “genome of x”, which is an over-used analogy. I actually think the analogy isn’t that unreasonable; like a genome, the information contained in your purchase/consumer history says something about you, but doesn’t tell the whole picture. I wonder how this information could be used for public health, since it is already being used for advertising….
  3. This PeerJ journal looks like it has the potential to be good.  They even encourage open peer review, which has some benefits. Not sure if it is sustainable, see for example, this breakdown of the costs. I still think we can do better.  
  4. Elon Musk is one of my favorite entrepreneurs. He tackles what I consider to be some of the most awe-inspiring and important problems around. This article about the Tesla S got me all fired up about how a person with vision can literally change the fuel we run on. Nothing to do with statistics, other than I think now is a similarly revolutionary time for our discipline. 
  5. There was some interesting discussion on Twitter of the usefulness of the Yelp dataset I posted for academic research. Not sure if this ever got resolved, but I think more and more as data sets from companies/startups become available, the terms of use for these data will be critical. 
  6. I’m still working on Roger’s puzzle from earlier this week. 

Peter Thiel on Peer Review/Science

Peter Theil gives his take on science funding/peer review:

My libertarian views are qualified because I do think things worked better in the 1950s and 60s, but it’s an interesting question as to what went wrong with DARPA. It’s not like it has been defunded, so why has DARPA been doing so much less for the economy than it did forty or fifty years ago? Parts of it have become politicized. You can’t just write checks to the thirty smartest scientists in the United States. Instead there are bureaucratic processes, and I think the politicization of science—where a lot of scientists have to write grant applications, be subject to peer review, and have to get all these people to buy in—all this has been toxic, because the skills that make a great scientist and the skills that make a great politician are radically different. There are very few people who are both great scientists and great politicians. So a conservative account of what happened with science in the 20thcentury is that we had a decentralized, non-governmental approach all the way through the 1930s and early 1940s. At that point, the government could accelerate and push things tremendously, but only at the price of politicizing it over a series of decades. Today we have a hundred times more scientists than we did in 1920, but their productivity per capita is less that it used to be.

Thiel has a history of making controversial comments, and I don’t always agree with him, but I think that his point about the politicization of the grant process is interesting. 

Interview with Héctor Corrada Bravo

Héctor Corrada Bravo

Héctor Corrada Bravo is an assistant professor in the Department of Computer Science and the Center for Bioinformatics and Computational Biology at the University of Maryland, College Park. He moved to College Park after finishing his Ph.D. in computer science at the University of Wisconsin and a postdoc in biostatistics at the Johns Hopkins Bloomberg School of Public Health. He has done outstanding work at the intersection of molecular biology, computer science, and statistics. For more info check out his webpage.

Read More

Cooperation between Referees and Authors Increases Peer Review Accuracy

Jeff Leek and colleagues just published an article in PLoS ONE on the differences between anonymous (closed) and non-anonymous (open) peer review of research articles. They developed a “peer review game” as a model system to track authors’ and reviewers’ behavior over time under open and closed systems.

Under the open system, it was possible for authors to see who was reviewing their work. They found that under the open system authors and reviewers tended to cooperate by reviewing each others’ work. Interestingly, they say

It was not immediately clear that cooperation between referees and authors would increase reviewing accuracy. Intuitively, one might expect that players who cooperate would always accept each others solutions - regardless of whether they were correct. However, we observed that when a submitter and reviewer acted cooperatively, reviewing accuracy actually increased by 11%.

Tags: Peer Review

Free access publishing is awesome…but expensive. How do we pay for it?

I am a huge fan of open access journals. I think open access is good both for moral reasons (science should be freely available) and for more selfish ones (I want people to be able to read my work). If given the choice, I would publish all of my work in journals that distribute results freely. 

But it turns out that for most open/free access systems, the publishing charges are paid by the scientists publishing in the journals. I did a quick scan and compiled this little table of how much it costs to publish a paper in different journals (here is a bigger table): 

  • PLoS One  $1,350.00
  • PLoS Biology: $2,900.00
  • BMJ Open $1,937.28
  • Bioinformatics (Open Access Option) $3,000.00
  • Genome Biology (Open Access Option) $2,500.00
  • Biostatistics (Open Access Option) $3,000.00

Read More

Do we really need applied statistics journals?

All statisticians in academia are constantly confronted with the question of where to publish their papers. Sometimes it’s obvious: A theoretical paper might go to the Annals of Statistics or JASA Theory & Methods or Biometrika. A more “methods-y” paper might go to JASA or JRSS-B or Biometrics or maybe even Biostatistics (where all three of us are or have been associate editors).

But where should the applied papers go? I think this is an increasingly large category of papers being produced by statisticians. These are papers that do not necessarily develop a brand new method or uncover any new theory, but apply statistical methods to an interesting dataset in a not-so-obvious way. Some papers might combine a set of existing methods that have never been combined before in order to solve an important scientific problem.

Well, there are some official applied statistics journals: JASA Applications & Case Studies or JRSS-C or Annals of Applied Statistics. At least they have the word “application” or “applied” in their title. But the question we should be asking is if a paper is published in one of those journals, will it reach the right audience?

What is the audience for an applied stat paper? Perhaps it depends on the subject matter. If the application is biology, then maybe biologists. If it’s an air pollution and health application, maybe environmental epidemiologists. My point is that the key audience is probably not a bunch of other statisticians.

The fundamental conundrum of applied stat papers comes down to this question: If your application of statistical methods is truly addressing an important scientific question, then shouldn’t the scientists in the relevant field want to hear about it? If the answer is yes, then we have two options: Force other scientists to read our applied stat journals, or publish our papers in their journals. There doesn’t seem to be much momentum for the former, but the latter is already being done rather frequently. 

Across a variety of fields we see statisticians making direct contributions to science by publishing in non-statistics journals. Some examples are this recent paper in Nature Genetics or a paper I published a few years ago in the Journal of the American Medical Association. I think there are two key features that these papers (and many others like them) have in common:

  • There was an important scientific question addressed. The first paper investigates variability of methylated regions of the genome and its relation to cancer tissue and the second paper addresses the problem of whether ambient coarse particles have an acute health effect. In both cases, scientists in the respective substantive areas were interested in the problem and so it was natural to publish the “answer” in their journals. 
  • The problem was well-suited to be addressed by statisticians. Both papers involved large and complex datasets for which training in data analysis and statistics was important. In the analysis of coarse particles and hospitalizations, we used a national database of air pollution concentrations and obtained health status data from Medicare. Linking these two databases together and conducting the analysis required enormous computational effort and statistical sophistication. While I doubt we were the only people who could have done that analysis, we were very well-positioned to do so. 

So when statisticians are confronted by a scientific problems that are both (1) important and (2) well-suited for statisticians, what should we do? My feeling is we should skip the applied statistics journals and bring the message straight to the people who want/need to hear it.

There are two problems that come to mind immediately. First, sometimes the paper ends up being so statistically technical that a scientific journal won’t accept it. And of course, in academia, there is the sticky problem of how do you get promoted in a statistics department when your CV is filled with papers in non-statistics journals. This entry is already long enough so I’ll address these issues in a future post.

Related Posts: Rafa on "Where are the Case Studies?" and "Authorship Conventions"