Help me find the good JSM talks

I’m about to head out for JSM in a couple of weeks. The sheer magnitude of the conference means it is pretty hard to figure out what talks I should attend. One approach I’ve used in the past is to identify people who I know give good talks and go to their talks. But that isn’t a very good talk-discovery mechanism. So this year I’m trying a crowd-sourcing experiment. 

First, some background on what kind of talks I like.

  • I strongly prefer talks where someone is tackling a problem presented by a new kind of data, whether they got that data from a collaborator, they scraped it off the web, or they generated it themselves.
  •  I am 100% ok if they only used linear regression to analyze the data if it led to interesting exploratory analysis, surprising results, or a cool conclusion. 
  • Major bonus points if the method is being used to solve a real-world problem.
  • I also really like creative and informative plots.
  • I prefer pictures to text/equations in general

On the other hand, I really am not a fan of talks where someone developed a method, no matter how cool, then started looking around for a data set to apply it to. 

If you know of anyone who is going to give a talk like that can you post it in the comments or tweet it to @simplystats with the hashtag #goodJSMtalks?

Also, if you know anyone who gives posters like this, lemme know so I can drop by.  

Thanks!!!

Tags: Talks JSM data

Sunday Data/Statistics Link Roundup (7/15/12)

  1. A really nice list of journals software/data release policies from Titus’ blog. Interesting that he couldn’t find a data/release policy for the New England Journal of Medicine. I wonder if that is because it publishes mostly clinical studies, where the data are often protected for privacy reasons? It seems like there is going to eventually be a big discussion of the relative importance of privacy and open data in the clinical world. 
  2. Some interesting software that can be used to build virtual workflows for computational science. It seems like a lot of data analysis is still done via “drag and drop” programs. I can’t help but wonder if our effort should be focused on developing drag and drop or educating the next generation of scientists to have minimum scripting capabilities. 
  3. We added StatsChat by Thomas L. and company to our blogroll. Lots of good stuff there, for example, this recent post on when randomized trials don’t help. You can also follow them on twitter.  
  4. A really nice post on processing public data with R. As more and more public data becomes available, from governments, companies, APIs, etc. the ability to quickly obtain, process, and visualize public data is going to be hugely valuable. 
  5. Speaking of public data, you could get it from APIs or from government websites. But beware those category 2 problems

Motivating statistical projects

It seems like half of the battle in statistics is identifying an important/unsolved problem. In math, this is easy, they have a list. So why is it harder for statistics? Since I have to think up projects to work on for my research group, for classes I teach, and for exams we give, I have spent some time thinking about ways that research problems in statistics arise.

I borrowed a page out of Roger’s book and made a little diagram to illustrate my ideas (actually I can’t even claim credit, it was Roger’s idea to make the diagram). The diagram shows the rough relationship of science, data, applied statistics, and theoretical statistics. Science produces data (although there are other sources), the data are analyzed using applied statistical methods, and theoretical statistics concerns the math behind statistical methods. The dotted line indicates that theoretical statistics ostensibly generalizes applied statistical methods so they can be applied in other disciplines. I do think that this type of generalization is becoming harder and harder as theoretical statistics becomes farther and farther removed from the underlying science.

Based on this diagram I see three major sources for statistical problems: 

  1. Theoretical statistical problems One component of statistics is developing the mathematical and foundational theory that proves we are doing sensible things. This type of problem often seems to be inspired by popular methods that exists/are developed but lack mathematical detail. Not surprisingly, much of the work in this area is motivated by what is mathematically possible or convenient, rather than by concrete questions that are of concern to the scientific community. This work is important, but the current distance between theoretical statistics and science suggests that the impact will be limited primarily to the theoretical statistics community. 
  2. Applied statistics motivated by convenient sources of data. The best example of this type of problem are the analyses in Freakonomics.  Since both big data and small big data are now abundant, anyone with a laptop and an internet connection can download the Google n-gram data, a microarray from GEO data about your city, or really data about anything and perform an applied analysis. These analyses may not be straightforward for computational/statistical reasons and may even require the development of new methods. These problems are often very interesting/clever and so are often the types of analyses you hear about in newspaper articles about “Big Data”. But they may often be misleading or incorrect, since the underlying questions are not necessarily well founded in scientific questions. 
  3. Applied statistics problems motivated by scientific problems. The final category of statistics problems are those that are motivated by concrete scientific questions. The new sources of big data don’t necessarily make these problems any easier. They still start with a specific question for which the data may not be convenient and the math is often intractable. But the potential impact of solving a concrete scientific problem is huge, especially if many people who are generating data have a similar problem. Some examples of problems like this are: can we tell if one batch of beer is better than another, how are quantitative characteristics inherited from parent to child, which treatment is better when some people are censored, how do we estimate variance when we don’t know the distribution of the data, or how do we know which variable is important when we have millions

So this leads back to the question, what are the biggest open problems in statistics? I would define these problems as the “high potential impact” problems from category 3. To answer this question, I think we need to ask ourselves, what are the most common problems people are trying to solve with data but can’t with what is available right now? Roger nailed this when he talked about the role of statisticians in the science club

Here are a few ideas that could potentially turn into high-impact statistical problems, maybe our readers can think of better ones?

  1. How do we credential students taking online courses at a huge scale?
  2. How do we communicate risk about personalized medicine (or anything else) to a general population without statistical training? 
  3. Can you use social media as a preventative health tool?
  4. Can we perform randomized trials to improve public policy?
Image Credits: The Science Logo is the old logo for the USU College of Science, the R is the logo for the R statistical programming language, the data image is a screenshot of Gapminder, and the theoretical statistics image comes from the Wikipedia page on the law of large numbers.

Edit: I just noticed this paper, which seems to support some of the discussion above. On the other hand, I think just saying lots of equations = less citations falls into category 2 and doesn’t get at the heart of the problem. 

A plot of my citations in Google Scholar vs. Web of Science

There has been some discussion about whether Google Scholar or one of the proprietary software companies numbers are better for citation counts. I personally think Google Scholar is better for a number of reasons:

  1. Higher numbers, but consistently/adjustably higher :-)
  2. It’s free and the data are openly available. 
  3. It covers more ground (patents, theses, etc.) to give a better idea of global impact
  4. It’s easier to use

I haven’t seen a plot yet relating Web of Science citations to Google Scholar citations, so I made one for my papers.

GS has about 41% more citations per paper than Web of Science. That is consistent with what other people have found. It also looks reasonably linearish. I wonder what other people’s plots would look like? 

Here is the R code I used to generate the plot (the names are Pubmed IDs for the papers):

library(ggplot2)

names = c(16141318,16357033,16928955,17597765,17907809,19033188,19276151,19924215,20560929,20701754,20838408, 21186247,21685414,21747377,21931541,22031444,22087737,22096506,22257669) 

y = c(287,122,84,39,120,53,4,52,6,33,57,0,0,4,1,5,0,2,0)

x = c(200,92,48,31,79,29,4,51,2,18,44,0,0,1,0,2,0,1,0)

Year = c(2005,2006,2007,2007,2007,2008,2009,2009,2011,2010,2010,2011,2012,2011,2011,2011,2011,2011,2012)

q <- qplot(x,y,xlim=c(-20,300),ylim=c(-20,300),xlab=”Web of Knowledge”,ylab=”Google Scholar”) + geom_point(aes(colour=Year),size=5) + geom_line(aes(x = y, y = y),size=2)

Sunday data/statistics link roundup (1/29)

  1. A really nice D3 tutorial. I’m 100% on board with D3, if they could figure out a way to export the graphics as pdfs, I think this would be the best visualization tool out there. 
  2. A personalized calculator that tells you what number (of the 7 billion or so) that you are based on your birth day. I’m person 4,590,743,884. Makes me feel so special….
  3. An old post of ours, on dongle communism. One of my favorite posts, it came out before we had much traffic but deserves more attention. 
  4. This isn’t statistics/data related but too good to pass up. From the Bones television show, malware fractals shaved into a bone. I love TV science. Thanks to Dr. J for the link.
  5. Stats are popular

Sunday Data/Statistics Link Roundup

  1. Statistics help for journalists (don’t forget to keep rating stories!) This is the kind of thing that could grow into a statisteracy page. The author also has a really nice plug for public schools
  2. An interactive graphic to determine if you are in the 1% from the New York Times (I’m not…).
  3. Mike Bostock’s d3.js presentation, this is some really impressive visualization software. You have to change the slide numbers manually but it is totally worth it. Check out slide 10 and slide 14. This is the future of data visualization. Here is a beginners tutorial to d3.js by Mike Dewar.
  4. An online diagnosis prediction start-up (Symcat) based on data analysis from two Hopkins Med students.

Finally, a bit of a bleg. I’m going to try to make this link roundup a regular post. If you have ideas for links I should include, tweet us @simplystats or send them to Jeff’s email. 

In the era of data what is a fact?

The Twitter universe is abuzz about this article in the New York Times. Arthur Brisbane, who responds to reader’s comments, asks 

I’m looking for reader input on whether and when New York Times news reporters should challenge “facts” that are asserted by newsmakers they write about.

He goes on to give a couple of examples of qualitative facts that reporters have used in stories without questioning the veracity of the claims. As many people pointed out in the comments, this is completely absurd. Of course reporters should check facts and report when the facts in their stories, or stated by candidates, are not correct. That is the purpose of news reporting. 

But I think the question is a little more subtle when it comes to quantitative facts and statistics. Depending on what subsets of data you look at, what summary statistics you pick, and the way you present information - you can say a lot of different things with the same data. As long as you report what you calculated, you are technically reporting a fact - but it may be deceptive. The classic example is calculating median vs. mean home prices. If Bill Gates is in your neighborhood, no matter what the other houses cost, the mean price is going to be pretty high! 

Two concrete things can be done to deal with the malleability of facts in the data age.

First, we need to require that our reporters, policy makers, politicians, and decision makers report the context of numbers they state. It is tempting to use statistics as blunt instruments, punctuating claims. Instead, we should demand that people using statistics to make a point embed them in the broader context. For example, in the case of housing prices, if a politician reports the mean home price in a neighborhood, they should be required to state that potential outliers may be driving that number up. How do we make this demand? By not believing any isolated statistics - statistics will only be believed when the source is quoted and the statistic is described.  

But this isn’t enough, since the context and statistics will be meaningless without raising overall statisteracy (statistical literacy, not to be confused with numeracy).  In the U.S. literacy campaigns have been promoted by library systems. Statisteracy is becoming just as critical; the same level of social pressure and assistance should be applied to individuals who don’t know basic statistics as those who don’t have basic reading skills. Statistical organizations, academic departments, and companies interested in analytics/data science/statistics all have a vested interest in raising the population statisteracy. Maybe a website dedicated to understanding the consequences of basic statistical concepts, rather than the concepts themselves?

And don’t forget to keep rating health news stories!

Help us rate health news reporting with citizen-science powered http://www.healthnewsrater.com

We here at Simply Statistics are big fans of science news reporting. We read newspapers, blogs, and the news sections of scientific journals to keep up with the coolest new research. 

But health science reporting, although exciting, can also be incredibly frustrating to read. Many articles have sensational titles, like "How using Facebook could raise your risk of cancer". The articles go on to describe some research and interview a few scientists, then typically make fairly large claims about what the research means. This isn’t surprising - eye catching headlines are important in this era of short attention spans and information overload. 

If just a few extra pieces of information were reported in science stories about the news, it would be much easier to evaluate whether the cancer risk was serious enough to shut down our Facebook accounts. In particular we thought any news story should report:

  1. A link back to the original research article where the study (or studies) being described was published. Not just a link to another news story. 
  2. A description of the study design (was it a randomized clinical trial? a cohort study? 3 mice in a lab experiment?)
  3. Who funded the study - if a study involving cancer risk was sponsored by a tobacco company, that might say something about the results.
  4. Potential financial incentives of the authors - if the study is reporting a new drug and the authors work for a drug company, that might say something about the study too. 
  5. The sample size - many health studies are based on a very small sample size, only 10 or 20 people in a lab. Results from these studies are much weaker than results obtained from a large study of thousands of people. 
  6. The organism - Many health science news reports are based on studies performed in lab animals and may not translate to human health. For example, here is a report with the headline "Alzheimers may be transmissible, study suggests". But if you read the story, scientists injected Alzheimer’s afflicted brain tissue from humans into mice. 

So we created a citizen-science website for evaluating health news reporting called HealthNewsRater. It was built by Andrew Jaffe and Jeff Leek, with Andrew doing the bulk of the heavy lifting.  We would like you to help us collect data on the quality of health news reporting. When you read a health news story on the Nature website, at nytimes.com, or on a blog, we’d like you to take a second to report on the news. Just determine whether the 6 pieces of information above are reported and input the data at HealthNewsRater.

We calculate a score for each story based on the formula:

HNR-Score = (5 points for a link to the original article + 1 point each for the other criteria)/2

The score weights the link to the original article very heavily, since this is the best source of information about the actual science underlying the story. 

In a future post we will analyze the data we have collected, make it publicly available, and let you know which news sources are doing the best job of reporting health science. 

Update: If you are a web-developer with an interest in health news contact us to help make HealthNewsRater better! 

Sunday Data/Statistics Link Roundup

A few data/statistics related links of interest:

  1. Eric Lander Profile
  2. The math of lego (should be “The statistics of lego”)
  3. Where people are looking for homes.
  4. Hans Rosling’s Ted Talk on the Developing world (an oldie but a goodie)
  5. Elsevier is trying to make open-access illegal (not strictly statistics related, but a hugely important issue for academics who believe government funded research should be freely accessible), more here

Where do you get your data?

Here’s a question I get fairly frequently from various types of people: Where do you get your data? This is sometimes followed up quickly with “Can we use some of your data?”

My contention is that if someone asks you these questions, start looking for the exits.

Read More