- Brian Caffo headlines the WaPo article about massive online open courses. He is the driving force behind our department’s involvement in offering these massive courses. I think this sums it up: `“I can’t use another word than unbelievable,” Caffo said. Then he found some more: “Crazy . . . surreal . . . heartwarming.”’
- A really interesting discussion of why “A Bet is a Tax on B.S.”. It nicely describes why intelligent betters must be disinterested in the outcome, otherwise they will end up losing money. The Nate Silver controversy just doesn’t seem to be going away, good news for his readership numbers, I bet. (via Rafa)
- An interesting article on how scientists are not claiming global warming is the sole cause of the extreme weather events we are seeing, but that it does contribute to them being more extreme. The key quote: “We can’t say that steroids caused any one home run by Barry Bonds, but steroids sure helped him hit more and hit them farther. Now we have weather on steroids.” —Eric Pooley. (via Roger)
- The NIGMS is looking for a Biomedical technology, Bioinformatics, and Computational Biology Director. I hope that it is someone who understands statistics! (via Karl B.)
- Here is another article that appears to misunderstand statistical prediction. This one is about the Italian scientists who were jailed for failing to predict an earthquake. No joke.
- We talk a lot about how much the data revolution will change industries from social media to healthcare. But here is an important reality check. Patients are not showing an interest in accessing their health care data. I wonder if part of the reason is that we haven’t come up with the right ways to explain, understand, and utilize what is inherently stochastic and uncertain information.
- The BMJ is now going to require all data from clinical trials published in their journal to be public. This is a brilliant, forward thinking move. I hope other journals will follow suit. (via Karen B.R.)
- An interesting article about the impact of retractions on citation rates, suggesting that papers in fields close to those of the retracted paper may show negative impact on their citation rates. I haven’t looked it over carefully, but how they control for confounding seems incredibly important in this case. (via Alex N.).
Amanda Palmer broke Twitter yesterday with her insurance poll. She started off just talking about how hard it is for musicians who rarely have health insurance, but then wandered into polling territory. She sent out a request for people to respond with the following information:
quick twitter poll. 1) COUNTRY?! 2) profession? 3) insured? 4) if not, why not, if so, at what cost per month (or covered by job)?
This quick little poll struck a nerve with people and her Twitter feed blew up. Long story short, tons of interesting information was gathered from folks. This information is frequently kept semi-obscured, particularly what is the cost of health insurance for folks in different places. This isn’t the sort of info that insurance companies necessarily publicize widely and isn’t the sort of thing people talk about.
The results were really fascinating and its worth reading the above blog post or checking out the hashtag: #insurancepoll. But the most fascinating thing for me as a statistician was thinking about how to analyze these data. @aubreyjaubrey is apparently collecting the data someplace, hopefully she’ll make it public.
At least two key issues spring to mind:
- This is a massive convenience sample.
- It is being collected through a social network
Although I’m sure there are more. If a student is looking for an amazingly interesting/rich data set and some seriously hard stats problems, they should get in touch with Aubrey and see if they can make something of it!
- Just got back from IBC 2012 in Kobe Japan. I was in an awesome session (organized by the inimitable Lieven Clement) with great talks by Matt McCall, Djork-Arne Clevert, Adetayo Kasim, and Willem Talloen. Willem’s talk nicely tied in our work and how it plays into the pharmaceutical development process and the bigger theme of big data. On the way home through SFO I saw this hanging in the airport. A fitting welcome back to the states. Although, as we talked about in our first podcast, I wonder how long the Big Data hype will last…
- Simina B. sent this link along for a masters program in analytics at NC State. Interesting because it looks a lot like a masters in statistics program, but with a heavier emphasis on data collection/data management. I wonder what role the stat department down there is playing in this program and if we will see more like it pop up? Or if programs like this with more data management will be run by stats departments other places. Maybe our friends down in Raleigh have some thoughts for us.
- If one set of weekly links isn’t enough to fill your procrastination quota, go check out NextGenSeek’s weekly stories. A bit genomics focused, but lots of cool data/statistics links in there too. Love the “extreme Venn diagrams”.
- This seems almost like the fast statistics journal I proposed earlier. Can’t seem to access the first issue/editorial board either. Doesn’t look like it is open access, so it’s still not perfect. But I love the sentiment of fast/single round review. We can do better though. I think Yihue X. has some really interesting ideas on how.
- My wife taught for a year at Grinnell in Iowa and loved it there. They just released this cool data set with a bunch of information about the college. If all colleges did this, we could really dig in and learn a lot about the American secondary education system (link via Hilary M.).
- From the way-back machine, a rant from Rafa about meetings. Stayed tuned this week for some Simply Statistics data about our first year on the series of tubes.
As Roger pointed out the most recent batch of Y Combinator startups included a bunch of data-focused companies. One of these companies, StatWing, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and a lot of the technical statistical details “under the hood”. I first read about StatWing on TechCrunch, where the title, “How Statwing Makes It Easier To Ask Questions About Data So You Don’t Have To Hire a Statistical Wizard”.
StatWing looks super user-friendly and the idea of democratizing statistical analysis so more people can access these ideas is something that appeals to me. But, as one of the aforementioned statistical wizards, this had me freaked out for a minute. Once I looked at the software though, I realized it suffers from the same problem that most “user-friendly” statistical software suffers from. It makes it really easy to screw up a data analysis. It will tell you when something is significant and if you don’t like that it isn’t, you can keep slicing and dicing the data until it is. The key issue behind getting insight from data is knowing when you are fooling yourself with confounders, or small effect sizes, or overfitting. StatWing looks like an improvement on the UI experience of data analysis, but it won’t prevent false positives that plague science and cost business big $$.
So I started thinking about what kind of software would prevent these sort of problems while still being accessible to a big audience. My idea is a “deterministic statistical machine”. Here is how it works, you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a pre-specified set of checks for outliers, confounders, missing data, maybe even data fudging. It generates a report with a markdown tool and then immediately publishes the result to figshare.
The advantage is that people can get their data-related questions answered using a standard tool. It does a lot of the “heavy lifting” in checking for potential problems and produces nice reports. But it is a deterministic algorithm for analysis so overfitting, fudging the analysis, etc. are harder. By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around.
The DSM should be a web service that is easy to use. Anybody want to build it? Any suggestions for how to do it better?
First off, a quick apology for missing last week, and thanks to Augusto for noticing! On to the links:
- Unbelievably the BRCA gene patents were upheld by the lower court despite the Supreme Court coming down pretty unequivocally against patenting correlations between metabolites and health outcomes. I wonder if this one will be overturned if it makes it back up to the Supreme Court.
- A really nice interview with David Spiegelhalter on Statistics and Risk. David runs the Understanding Uncertainty blog and published a recent paper on visualizing uncertainty. My favorite line from the interview might be: “There is a nice quote from Joel Best that “all statistics are social products, the results of people’s efforts”. He says you should always ask, “Why was this statistic created?” Certainly statistics are constructed from things that people have chosen to measure and define, and the numbers that come out of those studies often take on a life of their own.”
- For those of you who use Tumblr like we do, here is a cool post on how to put technical content into your blog. My favorite thing I learned about is the Github Gist that can be used to embed syntax-highlighted code.
- A few interesting and relatively simple stats for projecting the success of NFL teams. One thing I love about sports statistics is that they are totally willing to be super ad-hoc and to be super simple. Sometimes this is all you need to be highly predictive (see for example, the results of Football’s Pythagorean Theorem). I’m sure there are tons of more sophisticated analyses out there, but if it ain’t broke… (via Rafa).
- My student Hilary has a new blog that’s worth checking out. Here is a nice review of ProjectTemplate she did. I think the idea of having an organizing principle behind your code is a great one. Hilary likes ProjectTemplate, I think there are a few others out there that might be useful. If you know about them, you should leave a comment on her blog!
- This is ridiculously cool. Man City has opened up their data/statistics to the data analytics community. After registering, you have access to many of the statistics the club uses to analyze their players. This is yet another example of open data taking over the world. It’s clear that data generators can create way more value for themselves by releasing cool data, rather than holding it all in house.
- The Portland Public Library has created a website called Book Psychic, basically a recommender system for books. I love this idea. It would be great to have a recommender system for scientific papers.
Statisticians have not always been great self-promoters. I think in part this comes from our tendency to be arbiters rather than being involved in the scientific process. In some ways, I think this is a good thing. Self-promotion can quickly become really annoying. On the other hand, I think our advertising shortcomings are hurting our field in a number of different ways.
Here are a few:
- As Rafa points out even though statisticians are ridiculously employable right now it seems like statistics M.S. and Ph.D. programs are flying under the radar in all the hype about data/data science (here is an awesome one if you are looking). Computer Science and Engineering, even the social sciences, are cornering the market on “big data”. This potentially huge and influential source of students may pass us by if we don’t advertise better.
- A corollary to this is lack of funding. When the Big Data event happened at the White House with all the major funders in attendance to announce $200 million in new funding for big data, none of the invited panelists were statisticians.
- Our top awards don’t get the press they do in other fields. The Nobel Prize announcements are an international event. There is always speculation/intense interest in who will win. There is similar interest around the Fields medal in mathematics. But the top award in statistics, the COPSS award doesn’t get nearly the attention it should. Part of the reason is lack of funding (the Fields is $15k, the COPSS is $1k). But part of the reason is that we, as statisticians, don’t announce it, share it, speculate about it, tell our friends about it, etc. The prestige of these awards can have a big impact on the visibility of a field.
- A major component of visibility of a scientific discipline, for better or worse, is the popular press. The most recent article in a long list of articles at the New York Times about the data revolution does not mention statistics/statisticians. Neither do the other articles. We need to cultivate relationships with the media.
We are all busy solving real/hard scientific and statistical problems, so we don’t have a lot of time to devote to publicity. But here are a couple of easy ways we could rapidly increase the visibility of our field, ordered roughly by the degree of time commitment.
- All statisticians should have Twitter accounts and we should share/discuss our work and ideas online. The more we help each other share, the more visibility our ideas will get.
- We should make sure we let the ASA know about cool things that are happening with data/statistics in our organizations and they should spread the word through their Twitter account and other social media.
- We should start a conversation about who we think will win the next COPSS award in advance of the next JSM and try to get local media outlets to pick up our ideas and talk about the award.
- We should be more “big tent” about statistics. ASA President Robert Rodriguez nailed this in his speech at JSM. Whenever someone does something with data, we should claim them as a statistician. Sometimes this will lead to claiming people we don’t necessarily agree with. But the big tent approach is what is allowing CS and other disciplines to overtake us in the data era.
- We should consider setting up a place for statisticians to donate money to build up the award fund for the COPSS/other statistics prizes.
- We should try to forge relationships with start-up companies and encourage our students to pursue industry/start-up opportunities if they have interest. The less we are insular within the academic community, the more high-profile we will be.
- It would be awesome if we started a statistical literacy outreach program in communities around the U.S. We could offer free courses in community centers to teach people how to understand polling data/the census/weather reports/anything touching data.
Those are just a few of my ideas, but I have a ton more. I’m sure other people do too and I’d love to hear them. Let’s raise the tide and lift all of our boats!
Editor’s Note: This post written by Roger Peng and Jeff Leek.
A couple of weeks ago, we announced that we would be teaching free courses in Computing for Data Analysis and Data Analysis on the Coursera platform. At the same time, a number of other universities also announced partnerships with Coursera leading to a large number of new offerings. That, coupled with a new round of funding for Coursera, led to press coverage in the New York Times, the Atlantic, and other media outlets.
There was an ensuing explosion of blog posts and commentaries from academics. The opinions ranged from dramatic, to negative, to critical, to um…hilariously angry. Rafa posted a few days ago that many of the folks freaking out are missing the point - the opportunity to reach a much broader audience of folks with our course content.
[Before continuing, we’d like to make clear that at this point no money has been exchanged between Coursera and Johns Hopkins. Coursera has not given us anything and Johns Hopkins hasn’t given them anything. For now, it’s just a mutually beneficial partnership — we get their platform and they get to use our content. In the future, Coursera will need to figure out a way to make money, and they are currently considering a number of options.]
Now that the initial wave of hype has died down, we thought we’d outline why we are excited about participating in Coursera. We think it is only fair to start by saying this is definitely an experiment. Coursera is a newish startup and as such is still figuring out its plan/business model. Similarly, our involvement so far has been a little whirlwind and we haven’t actually taught courses yet, and we are happy to collect data and see how things turn out. So ask us again in 6 months when we are both done teaching.
But for now, this is why we are excited.
- Open Access. As Rafa alluded to in his post, this is an opportunity to reach a broad and diverse audience. As academics devoted to open science, we also think that opening up our courses to the biggest possible audience is, in principle, a good thing. That is why we are both basing our courses on free software and teaching the courses for free to anyone with an internet connection.
- Excitement about statistics. The data revolution means that there is a really intense interest in statistics right now. It’s so exciting that Joe Blitzstein’s stat class on iTunes U has been one of the top courses on that platform. Our local superstar John McGready has also put his statistical reasoning course up on iTunes U to a similar explosion of interest. Rafa recently put his statistics for genomics lectures up on Youtube and they have already been viewed thousands of times. As people who are super pumped about the power and importance of statistics, we want to get in on the game.
- We work hard to develop good materials. We put effort into building materials that our students will find useful. We want to maximize the impact of these efforts. We have over 30,000 students enrolled in our two courses so far.
- It is an exciting experiment. Online teaching, including very very good online teaching, has been around for a long time. But the model of free courses at incredibly large scale is actually really new. Whether you think it is a gimmick or something here to stay, it is exciting to be part of the first experimental efforts to build courses at scale. Of course, this could flame out. We don’t know, but that is the fun of any new experiment.
- Good advertising. Every professor at a research school is a start-up of one. This idea deserves it’s own blog post. But if you accept that premise, to keep the operation going you need good advertising. One way to do that is writing good research papers, another is having awesome students, a third is giving talks at statistical and scientific conferences. This is an amazing new opportunity to showcase the cool things that we are doing.
- Coursera built some cool toys. As statisticians, we love new types of data. It’s like candy. Coursera has all sorts of cool toys for collecting data about drop out rates, participation, discussion board answers, peer review of assignments, etc. We are pretty psyched to take these out for a spin and see how we can use them to improve our teaching.
- Innovation is going to happen in education. The music industry spent years fighting a losing battle over music sharing. Mostly, this damaged their reputation and stopped them from developing new technology like iTunes/Spotify that became hugely influential/profitable. Education has been done the same way for hundreds (or thousands) of years. As new educational technologies develop, we’d rather be on the front lines figuring out the best new model than fighting to hold on to the old model.
Finally, we’d like to say a word about why we think in-person education isn’t really threatened by MOOCs, at least for our courses. If you take one of our courses through Coursera you will get to see the lectures and do a few assignments. We will interact with students through message boards, videos, and tutorials. But there are only 2 of us and 30,000 people registered. So you won’t get much one on one interaction. On the other hand, if you come to the top Ph.D. program in biostatistics and take Data Analysis, you will now get 16 weeks of one-on-one interaction with Jeff in a classroom, working on tons of problems together. In other words, putting our lectures online now means at Johns Hopkins you get the most qualified TA you have ever had. Your professor.
- This paper is the paper describing how Uri Simonsohn identified academic misconduct using statistical analyses. This approach has received a huge amount of press in the scientific literature. The basic approach is that he calculates the standard deviations of mean/standard deviation estimates across groups being compared. Then he simulates from a Normal distribution and shows that under the Normal model, it is unlikely that the means/standard deviations are so similar. I think the idea is clever, but I wonder if the Normal model is the best choice here…could the estimates be similar because it was the same experimenter, etc.? I suppose the proof is in the pudding though, several of the papers he identifies have been retracted.
- This is an amazing rant by a history professor at Swarthmore over the development of massive online courses, like the ones Roger, Brian and I are teaching. I think he makes some important points (especially about how we could do the same thing with open access in a heart beat if universities/academics through serious muscle behind it), but I have to say, I’m personally very psyched to be involved in teaching one of these big classes. I think that statistics is a field that a lot of people would like to learn something about and I’d like to make it easier for them to do that because I love statistics. I also see the strong advantage of in-person education. The folks who enroll at Hopkins and take our courses will obviously get way more one-on-one interaction, which is clearly valuable. I don’t see why it has to be one or the other…
- An interesting discussion with Facebook’s former head of big data. I think the first point is key. A lot of the “big data” hype has just had to do with the infrastructure needed to deal with all the data we are collecting. The bigger issue (and where statisticians will lead) is figuring out what to do with the data.
- This is a great post about data smuggling. The two key points that I think are raised are: (1) how when the data get big enough, they have their own mass and aren’t going to be moved, and (2) how physically mailing harddrives is still the fastest way of transferring big data sets. That is certainly true in genomics where it is called “sneaker net” when a collaborator walks a hard drive over to our office. Hopefully putting data in physical terms will drive home the point that the new scientists are folks that deal with/manipulate/analyze data.
- Not statistics related, but here is a high-bar to hold your work to: the bus-crash test. If you died in a bus-crash tomorrow, would your discipline notice? Yikes. Via C.T. Brown.
Lauren Talbot is a quantitative analyst for the New York City Financial Crime Task Force. Before working for NYC she was an analyst at Acumen LLC and got her degree in economics from Stanford University. She is a key player turning spatial data in NYC into new tools for government management. We talked to Lauren about her work, how she is using open data to do things like predict where fires might occur, and how she got started in the Financial Crime Task Force.
SS: Do you consider yourself a statistician, computer scientist, or something else?
LT: A lot of us can’t call ourselves statisticians or computer scientists, even if that is a large part of what we do, because we never studied those fields formally. Quantitative or Data Analyst are popular job titles, but don’t really do justice to all the code infrastructure/systems you have to build and cultivate — you aren’t simply analyzing, you are matching and automating and illustrating, too. There is also a large creative aspect, because you have to figure out how to present the data in a way that is useful and compelling to people, many of whom have no prior experience working with data. So I am glad people have started using the term “Data Scientist,” even if makes me chuckle a little. Ideally I would call myself “Data Artist,” or “Data Whisperer,” but I don’t think people would take me seriously.
SS: How did you end up in the NYC Mayor’s Financial Crimes Task Force?
LT: I actually responded to a Craigslist posting. While I was still in the Bay Area (where I went to college), I was looking for jobs in NYC because I wanted to relocate back here, where I am originally from. I was searching for SAS programmer jobs, and finding a lot of stuff in healthcare that made me yawn a little. And then I had the idea to try the government jobs section. The Financial Crimes Task Force (now part of a broader citywide analytics effort under the Office of Policy and Strategic Planning) was one of two listings that popped up, and I read the description and immediately thought “dream job!” It has turned out to be even better than I imagined, because there is such a huge opportunity to make a difference — the Bloomberg administration is actually very interested in operationalizing insights from city data, so they are listening to the data people and using their work to inform agency resource allocation and even sometimes policy. My fellow are also just really fun and intelligent. I’m constantly impressed by how quickly they pick up new skills, get to the bottom of things, and jump through hoops to get things done. We also amuse and entertain each other throughout the day, which is awesome.
SS: Can you tell us about one of the more interesting cases you have tackled and how data analysis/statistics played into the case?
LT: Since this is the NYC Mayor’s Office, dealing with city data, almost of our analyses are in some way location-based. We are trying to answer questions like, “what locations are most likely to have a catastrophic event (e.g. fire) in the near future?” This involves combining many disparate datasets such as fire data, buildings data, emergency calls data, city planning data, even garbage data. We use the tax lot ID as a common identifier, but many of the datasets do not come with this variable - they only have a text address or intersection. In many cases, the address is entered manually and has spelling mistakes. In the beginning, we were using a point-and-click geocoding tool that the city provides that reads the text field and assigns the tax lot ID. However, it was taking a long time to prepare the data so it could be used by the program, and the program was returning many errors. When we visually inspected the errors, we saw that they were caused by minor spelling differences and naming conventions. Now, almost every week we get new datasets in different structures, and we need to geocode them immediately before we can really work with them. So we needed a geocoding program that was automated and flexible, as well as capable of geocoding addresses and intersections with spelling errors and different conventions. Over the past few months, using publicly available city planning datasets and regular expressions, my side project has been creating such a program in SAS. My first test case was self-reported data created solely through user entry. This dataset, which could only be 40% geocoded using the original tool, is now 93% geocoded using the program we developed. The program is constantly evolving and improving. Now it is assigning block faces, spellchecking street and city names, and accounting for the occasional gaps in the data. We use it for everything.
SS: What are the computational tools and ideas you use most frequently in your day to day work (R, databases, regression analysis, etc.)?
LT: In the beginning, all of the data was sent to us in SQL or Excel, which was not very efficient. Now we are building a multi-agency SAS platform that can be used by programmers and non-programmers. Since there are so many data sources that can work together, having a unified platform creates new discoveries that agencies can use to be more efficient or effective. For example, a building investigator can use 311 noise complaints to uncover vacated properties that are being illegally occupied. The platform employs Palantir, which is an excellent front-end tool for playing around with the data and exploring many-to-many relationships. Internally, my team has also used R, Python, Java, even VBA. Whatever gets the job done. We use a good mix of statistical tools. The bread and butter is usually manipulating and understanding new data sources, which is necessary before we can start trying to do something like run a multiple regression, for example. In the end, it’s really a mashup: text parsing, name matching, summarizing/describing/reporting using comparative statistics, geomapping, graphing, logistic regression, even kernel density, can all be part of the mix. Our guiding principle is to use the tool/analysis/strategy that has the highest return on investment of time and analyst resources for the city.
SS: What are the challenges of working as a quantitative analyst in a regulatory role? Is it hard to make your analyses/discoveries understandable?
LT: A lot of data analysts working in government have a difficult time getting agencies and policymakers to take action based on their work due to political priorities and organizational structures. We circumvent that issue by operating based on the needs and requests of the agencies, as well as paying attention to current events. An agency or official may come to us with a problem, and we figure out what we can deliver that will be of use to them. This starts a dialogue. It becomes an iterative process, and projects can grow and morph once we have feedback. Oftentimes, it is better to use a data-mining approach, which is more understandable to non-statisticians, rather than a regression, which can seem like a black box. For example, my colleague came up with an algorithm to target properties that were a high fire risk based on the presence of illegal conversion complaints and evidence that the property owner was under financial distress. He began with a simple list of properties for the Department of Buildings to focus on, and now they go out to inspect a list of places selected by his algorithm weekly. This video of the fire chief speaking about the project illustrates the challenges encountered and why the simpler approach was ultimately successful:http://www.youtube.com/watch?v=425QSx0U8lU&feature=youtube_gdata_player
SS: Do you have any advice for statisticians/data scientists who want to get involved with open government or government data analysis?
LT: I’ve found that people in government are actually very open to and interested in using data. The first challenge is that they don’t know that the data they have is of value. To be the most effective, you should get in touch with the people who have subject matter expertise (usually employees who have been working on the ground for some time), interview them, check your assumptions, and share whatever you’re seeing in the data on an ongoing basis. Not only will both parties learn faster, but it helps build a culture of interest in the data. Once people see what is possible, they will become more creative and start requesting deliverables that are increasingly actionable. The second challenge is getting data, and the legal and social/political issues surrounding that. The big secret is that so much useful data is actually publicly available. Do your research — you may find what you need without having to fight for it. If what you need is protected, however, consider whether the data would still be useful to you if scrubbed of personally identifiable information. Location-based data is a good example of this. If so, see whether you can negotiate with the data owner to obtain only the parts needed to do your analysis. Finally, you may find that the cohort of data scientists in government is all too sparse, and too few people “speak your language.” Reach out and align yourself with people in other agencies who are also working with data. This is a great way to gain new insight into the goals and issues of your administration, as well as friends to support and advise you as you navigate “the system.”