Sunday Data/Statistics Link Roundup (9/16/12)

  1. There has been a lot of talk about the Michael Lewis (of Moneyball fame) profile of Obama in Vanity fair. One interesting quote I think deserves a lot more discussion is: “On top of all of this, after you have made your decision, you need to feign total certainty about it. People being led do not want to think probabilistically.” This is a key issue that is only going to get worse going forward. All of public policy is probabilistic - we are even moving to clinical trials to evaluate public policy
  2. It’s sort of amazing to me that I hadn’t heard about this before, but a UC Davis professor was threatened for discussing the reasons PSA screening may be overused. This same issue keeps coming up over and over - screening healthy populations for rare diseases is often not effective (you need a ridiculously high specificity or a treatment with almost no side effects). What we need is John McGready to do a claymation public service video or something explaining the reasons screening might not be a good idea to the general public. 
  3. A bleg - I sometimes have a good week finding links myself and there are a few folks who regularly send links (Andrew J., Alex N., etc.) I’d love it if people would send me cool links when they see them with the email title, “Sunday LInks” - i’m sure there is more cool stuff out there. 
  4. The ICSB has a competition to improve the coverage of computational biology on Wikipedia. Someone should write a surrogate variable analysis or robust multiarray average article. 
  5. I had not hear of the ASA’s Stattrak until this week, it looks like there are some useful resources there for early career statisticians. With the onset of fall, it is closing in on a new recruiting season. If you are a postdoc/student on the job market and you haven’t read Rafa’s post on soft vs. hard money, now is the time to start brushing up! Stay tuned for more job market posts this fall from Simply Statistics. 

Nature is hiring a data editor…how will they make sense of the data?

It looks like the journal Nature is hiring a Chief Data Editor (link via Hilary M.). It looks like the primary purpose of this editor is to develop tools for collecting, curating, and distributing data with the goal of improving reproducible research.

The main duties of the editor, as described by the ad are: 

Nature Publishing Group is looking for a Chief Editor to develop a product aimed at making research data more available, discoverable and interpretable.

The ad also mentions having an eye for commercial potential; I wonder if this move was motivated by companies like figshare who are already providing a reproducible data service. I haven’t used figshare, but the early reports from friends who have are that it is great. 

The thing that bothered me about the ad is that there is a strong focus on data collection/storage/management but absolutely no mention of the second component of the data science problem: making sense of the data. To make sense of piles of data requires training in applied statistics (called by whatever name you like best). The ad doesn’t mention any such qualifications. 

Even if the goal of the position is just to build a competitor to figshare, it seems like a good idea for the person collecting the data to have some idea of what researchers are going to do with it. When dealing with data, those researchers will frequently be statisticians by one name or another. 

Bottom line: I’m stoked Nature is recognizing the importance of data in this very prominent way. But I wish they’d realize that a data revolution also requires a revolution in statistics. 

Interview with Amy Heineike - Director of Mathematics at Quid

Amy Heineike

Amy Heineike is the Director of Mathematics at Quid, a startup that seeks to understand technology development and dissemination through data analysis. She was the first employee at Quid, where she helped develop their technology early on. She has been recognized as one of the top Big Data Scientists. As a part of our ongoing interview series talked to Amy about data science, Quid, and how statisticians can get involved in the tech scene. 

Which term applies to you: data scientist, statistician, computer scientist, or something else?
Data Scientist fits better than any, because it captures the mix of analytics, engineering and product management that is my current day to day.  
When I started with Quid I was focused on R&D - developing the first prototypes of what are now our core analytics technologies, and working to define and QA new data streams.  This required the analysis of lots of unstructured data, like news articles and patent filings, as well as the end visualisation and communication of the results.  
After we raised VC funding last year I switched to building our data science and engineering teams out.  These days I jump from conversations with the team about ideas for new analysis, to defining refinements to our data model, to questions about scalable architecture and filling out pivotal tracker tickets.  The core challenge is translating the vision for the product back to the team so they can build it.
 
 How did you end up at Quid?
In my previous work I’d been building models to improve our understanding of complex human systems - in particular the complex interaction of cities and their transportation networks in order to evaluate the economic impacts of, Crossrail, a new train line across London, and the implications of social networks on public policy.  Through this work it became clear that data was the biggest constraint - I became fascinated by a quest to find usable data for these questions - and thats what led me to Silicon Valley.  I knew the founders of Quid from University, and approached them with the idea of analysing their data according to ideas I’d had - especially around network analysis - and the initial work we collaborated on became core to the founding techology of Quid.
Who were really good mentors to you? What were the qualities that helped you? 
I’ve been fortunate to work with some brilliant people in my career so far.  While I still worked in London I worked closely with two behavioural economists - Paul Ormerod, who’s written some fantastic books on the subject (mostly recently Why Things Fail), and Bridget Rosewell, until recently the Chief Economist to the Greater London Authority (the city government for London).  At Quid I’ve had a very productive collaboration with Sean Gourley, our CTO.
One unifying characteristic of these three is their ability to communicate complex ideas in a powerful way to a broad audience.  Its an incredibly important skill, a core part of analytics work is taking the results to where they are needed which is often beyond those who know the technical details, to those who care about the implications first.
 
How does Quid determine relationships between organizations and develop insight based on data? 
The core questions our clients ask us are around how technology is changing and how this impacts their business.  Thats a really fascinating and huge question that requires not just discovering a document with the answer in it, but organizing lots and lots of pieces of data to paint a picture of the emergent change.  What we can offer is not only being able to find a snapshot of that, but also being able to track how it changes over time.
We organize the data firstly through the insight that much disruptive technology emerges in organizations, and that the events that occur between and to organizations are a fantastic way to signal both the traction of technologies and to observe strategic decision making by key actors.
The first kind of relationship thats important is of the transactional type, who is acquiring, funding or partnering with who, and the second is an estimate of the technological clustering of organizations, what trends do particular organizations represent.  Both of these can be discovered through documents about them, including in government filings, press releases and news, but requires analysis of unstructured natural language.  
 
We’ve experimented with some very engaging visualisations of the results, and have had particular success with network visualisations, which are a very powerful way of allowing people to interact with a large amount of data in a quite playful way.  You can see some of our analyses in the press links at http://quid.com/in-the-news.php
What skills do you think are most important for statisticians/data scientists moving into the tech industry?
Technical statistical chops are the foundation. You need to be able to take a dataset and discover and communicate what’s interesting about it for your users.  To turn this into a product requires understanding how to turn one-off analysis into something reliable enough to run day after day, even as the data evolves and grows, and as different users experience different aspects of it.  A key part of that is being willing to engage with questions about where the data comes from (how it can be collected, stored, processed and QAed on an ongoing basis), how the analytics will be run (how will it be tested, distributed and scaled) and how people interact with it (through visualisations, UI features or static presentations?).  
For your ideas to become great products, you need to become part of a great team though!  One of the reasons that such a broad set of skills are associated with Data Science is that there are a lot of pieces that have to come together for it to all work out - and it really takes a team to pull it off.  Generally speaking, the earlier stage the company that you join, the broader the range of skills you need, and the more scrappy you need to be about getting involved in whatever needs to be done.  Later stage teams, and big tech companies may have roles that are purer statistics.
 
Do you have any advice for grad students in statistics/biostatistics on how to get involved in the start-up community or how to find a job at a start-up? 
There is a real opportunity for people who have good statistical and computational skills to get into the startup and tech scenes now.  Many people in Data Science roles have statistics and biostatistics backgrounds, so you shouldn’t find it hard to find kindred spirits.

We’ve always been especially impressed with people who have built software in a group and shared or distributed that software in some way.  Getting involved in an open source project, working with version control in a team, or sharing your code on github are all good ways to start on this.
Its really important to be able to show that you want to build products though.  Imagine the clients or users of the company and see if you get excited about building something that they will use.  Reach out to people in the tech scene, explore who’s posting jobs - and then be able to explain to them what it is you’ve done and why its relevant, and be able to think about their business and how you’d want to help contribute towards it.  Many companies offer internships, which could be a good way to contribute for a short period and find out if its a good fit for you.

Why statisticians should join and launch startups

The tough economic times we live in, and the potential for big paydays, have made entrepreneurship cool. From the venture capitalist-in-chief, to the javascript coding mayor of New York, everyone is on board. No surprise there, successful startups lead to job creation which can have a major positive impact on the economy. 

The game has been dominated for a long time by the folks over in CS. But the value of many recent startups is either based on, or can be magnified by, good data analysis. Here are a few startups that are based on data/data analysis: 

  1. The Climate Corporation -analyzes climate data to sell farmers weather insurance.
  2. Flightcaster - uses public data to predict flight delays
  3. Quid - uses data on startups to predict success, among other things.
  4. 100plus - personalized health prediction startup, predicting health based on public data
  5. Hipmunk - The main advantage of this site for travel is better data visualization and an algorithm to show you which flights have the worst “agony”.

To launch a startup you need just a couple of things: (1) a good, valuable source of data (there are lots of these on the web) and (2) a good idea about how to analyze them to create something useful. The second step is obviously harder than the first, but the companies above prove you can do it. Then, once it is built, you can outsource/partner with developers - web and otherwise - to implement your idea. If you can build it in R, someone can make it an app. 

These are just a few of the startups whose value is entirely derived from data analysis. But companies from LinkedIn, to Bitly, to Amazon, to Walmart are trying to mine the data they are generating to increase value. Data is now being generated at unprecedented scale by computers, cell phones, even thremostats! With this onslaught of data, the need for people with analysis skills is becoming incredibly acute

Statisticians, like computer scientists before them, are poised to launch, and make major contributions to, the next generation of startups. 

Grad students in (bio)statistics - do a postdoc!

Up until about 20 years ago, postdocs were scarce in Statistics. In contrast, during the same time period, it was rare for a Biology PhD to go straight into a tenure track position.

Driven mostly by the availability of research funding for those working in applied areas,  postdocs are becoming much more common in our field and I think this is great. It is great for PhD students to expand their horizons during two years in which they don’t have to worry about teaching, committee meetings, or grant writing. It is also great for those of us fortunate enough to work with well-trained, independent, energetic, bright, and motivated fresh PhDs. Many of our best graduates are electing to postpone their entry into tenure track jobs in favor of postdocs. Also students from other fields, computer science and engineering in particular, are taking postdocs with statisticians. I think these are both good trends. If they continue, the result will be that, as a field, we will become more well-rounded and productive. 

This trend has been particularly beneficial for me. Most of the postdocs I have hired have come to me with a CV worthy of a tenure track job. They have been independent and worked more as collaborators than advisees. So why pass on more $ and prestige? A PhD in Statistics/Computer Science/Engineering can be on a very specific topic and students may not gain any collaborative experience whatsoever. A postdoc at Hopkins Biostat provides a new experience in a highly collaborative environment, with access to world leaders in the biomedical sciences, and where we focus on development of applied tools. The experience can also improve a student’s visibility and job prospects, while delaying the tenure clock until they have more publications under their belts.

An important thing you should be aware of is that in many departments you can negotiate the start of a tenure track position. So seriously consider taking 1-2 years of almost 100% research time before commencing the grind of a tenure track job. 

I’m not the only one who thinks postdocs are a good thing for our field and for biostatistics students. The column below was written by Terry Speed in November 2003 and is reprinted with permission from the IMS Bulletin, http://bulletin.imstat.org

In Praise of Postdocs

I don’t know what proportion of IMS members have PhDs (or an equivalent) in probability or statistics, but I’d guess it’s fairly high. I don’t know what proportion of those that do have PhDs would also have formal post-doctoral research experience, but here I’d guess it’s rather low.

Why? One possible reason is that for much of the last 40 years, anyone completing a PhD in prob or stat and wanting a research career, could go straight into one. Prospective employers of people with PhDs in our field—be they universities, research institutes, national labs or companies—don’t require their novices to have completed a postdoc, and most graduating PhDs are only to happy to go straight into their first job.

This is in sharp contrast with the biological and physical sciences, where it is rare to appoint someone to a tenure-track faculty or research scientist position without their having completed one or more postdocs.

Thee number of people doing postdocs in probability or statistics has been growing over the last 15 years. This is in part due to the arrival on the scene of institutes such as the MSRI, IMA, IPAM, NISS, NCAR, and recently the MBI and SAMSI in the US, the Newton Institute in the UK, the Fields Institute in Canada, the Institut Henri Poincaré in France, and others elsewhere around the world. In such institutes short- term postdoc positions go with their current research programs, and there are usually a smaller number continuing for longer periods.

It is also the case that an increasing number of senior researchers are being awarded research funds to support postdocs in prob or stat, often in the newer, applied areas such as computational biology.

And finally, it is has long been the case that many countries (Germany, Sweden, Switzerland, and the US, to name a few) have national grants supporting postdoctoral research in their own or, even better, another country. I think all of this is great, and would like to see this trend continue and strengthen.

Why do I think postdocs are a good thing? And why do I think young probabilists and statisticians should do one, even when they can get a good job without having done so?

For most of us, doing a PhD means getting totally absorbed in some relatively narrow research area for 2–3 years, treating that as the most important part of science for that time, and trying to produce some of the best work in that area. This is fine, and we get a PhD for our efforts, but is it good training for a lifelong research career? While it is obviously good preparation for doing more of the same, I don’t think it is adequate for research in general. I regard the successful completion of a PhD as (at least) evidence that the person in question can do research, but it doesn’t follow that they can go on and successfully do research in new area, or in a different environment, or without close supervision.

Postdocs give you the chance to broaden, to learn new technical skills, to become acquainted with new areas, and to absorb the culture of a new institution, all at a time when your professional responsibilities are far fewer than they would have been had you taken that first “real” job. The postdoc period can be a wonderful time in your scientific life, one which sees you blossom, building on the confidence you gained by having completed your PhD, in what is still essentially a learning environment, but one where you can follow your own interests, explore new areas, and still make mistakes. At the worst, you have delayed your entry into the workforce two or three years, and you can still keep on working in your PhD area if you wish. The number of openings for researchers in prob or stat doesn’t fluctuate so much on this time scale, so you are unlikely to be worse off than the earnings foregone. At best, you will move into a completely new area of research, one much better suited to your personal interests and skills, perhaps also better suited to market demand, but either way, one chosen with your PhD experience behind you. This can greatly enhance your long-term career prospects and more than compensate for your delayed entry into the workforce.

Students: the time to think about this is now [November], not just as you are about to file your dissertation. And the choice is not necessarily one between immediate security and career development: you might be able to have both. You shouldn’t shy from applying for tenure-track jobs and postdocs at the same time, and if offered the job you want, requesting (say) two years’ leave of absence to do the postdoc you want. Employers who care about your career development are unlikely to react badly to such a request.

Why does Obama need statisticians?

It’s worth following up a little on why the Obama campaign is recruiting statisticians (note to Karen: I am not looking for a new job!). Here’s the blurb for the position of “Statistical Modeling Analyst”:

The Obama for America Analytics Department analyzes the campaign’s data to guide election strategy and develop quantitative, actionable insights that drive our decision-making. Our team’s products help direct work on the ground, online and on the air. We are a multi-disciplinary team of statisticians, mathematicians, software developers, general analysts and organizers - all striving for a single goal: re-electing President Obama. We are looking for staff at all levels to join our department from now through Election Day 2012 at our Chicago, IL headquarters.

Statistical Modeling Analysts are charged with predicting electoral outcomes using statistical models. These models will be instrumental in helping the campaign determine how to most effectively use its resources.

I wonder if there’s a bonus for predicting the correct outcome, win or lose?

The Obama campaign didn’t invent the idea of heavy data analysis in campaigns, but they seem to be heavy adopters. There are 3 openings in the “Analytics” category as of today.

Now, can someone tell me why they don’t just call it simply “Statistics”?