Interview with Tom Louis - New Chief Scientist at the Census Bureau

Tom Louis


Tom Louis is a professor of Biostatistics at Johns Hopkins and will be joining the Census Bureau through an interagency personnel agreement as the new associate director for research and methodology and chief scientist. Tom has an impressive history of accomplishment in developing statistical methods for everything from environmental science to genomics. We talked to Tom about his new role at the Census, how it relates to his impressive research career, and how young statisticians can get involved in the statistical work at the Census. 


SS: How did you end up being invited to lead the research branch of the Census?

TL: Last winter, then-director Robert Groves (now Provost at Georgetown University) asked if I would be interested in  the possibility of becoming the next Associate Director of Research and Methodology (R&M) and Chief Scientist, succeeding  Rod Little (Professor of Biostatistics at the University of Michigan) in these roles.  I expressed interest and after several discussions with Bob and Rod, decided that if offered, I would accept.  It was offered and I did accept.  

As background,  components of my research, especially Bayesian methods, is Census-relevant.  Furthermore, during my time as a member of the National Academies Committee on National Statistics I served on the panel that recommended improvements in small area income and poverty estimates, chaired the panel that evaluated methods for allocating federal and state program funds by formula, and chaired a workshop on facilitating innovation in the Federal statistical system.

Rod and I noted that it’s interesting and possibly not coincidental that with my appointment the first two associate directors are both former chairs of Biostatistics departments.  It is the case that R&D’s mission is quite similar to that of a Biostatistics department; methods and collaborative research, consultation and education.  And, there are many statisticians at the Census Bureau who are not in the R&D directorship, a sociology quite similar to that in a School of Public Health or a Medical campus. 


SS: What made you interested in taking on this major new responsibility?

TL: I became energized by the opportunity for national service, and excited by the scientific, administrative, and sociological responsibilities and challenges.  I’ll be engaged in hiring and staff development, and increasing the visibility of the bureau’s pre- and post-doctoral programs.  The position will provide the impetus to take a deep dive into finite-population statistical approaches, and contribute to the evolving understanding of the strengths and weakness of design-based, model-based and hybrid approaches to inference.  That I could remain a Hopkins employee by working via an Interagency Personnel Agreement, sealed the deal.  I will start in January 2013 and serve through 2015, and will continue to participate in some Hopkins-based activities.

In addition to activities within the Census Bureau, I’ll be increasing connections among statisticians in other federal statistical agencies, have a role in relations with researchers funded through the NSF to conduct census-related research.



SS: What are the sorts of research projects the Census is involved in? 

TL: The Census Bureau designs and conducts the decennial Census, the Current Population Survey, the American Community Survey, many, many other surveys for other Federal Statistical Agencies including the Bureau of Labor Statistics, and a quite extraordinary portfolio of others. Each identifies issues in design and analysis that merit attention, many entail “Big Data” and many require combining information from a variety of sources.  I give a few examples, and encourage exploration of www.census.gov/research.

You can get a flavor of the types of research from the titles of the six current centers within R&M: The Center for Adaptive Design, The Center for Administrative Records Research and Acquisition, The Center for Disclosure Avoidance Research, The Center for Economic Studies, The Center for Statistical Research and Methodology and The Center for Survey Measurement.  Projects include multi-mode survey approaches, stopping rules for household visits, methods of combining information from surveys and administrative records, provision of focused estimates while preserving identity protection,  improved small area estimates of income and of limited english skills (used to trigger provision of election ballots in languages other than English), and continuing investigation of issues related to model-based and design-based inferences.



 
SS: Are those projects related to your research?

TL: Some are, some will be, some will never be.  Small area estimation, hierarchical modeling with a Bayesian formalism, some aspects of adaptive design, some of combining evidence from a variety of sources, and general statistical modeling are in my power zone.  I look forward to getting involved in these and contributing to other projects.



SS: How does research performed at the Census help the American Public?

TL: Research innovations enable the bureau to produce more timely and accurate information at lower cost, improve validity (for example, new approaches have at least maintained respondent participation in surveys), enhancing the reputation of the the Census Bureau as a trusted source of information.  Estimates developed by Census are used to allocate billions of dollars in school aid, and the provide key planning information for businesses and governments.



SS: How can young statisticians get more involved in government statistical research?

TL: The first step is to become aware of the wide variety of activities and their high impact.  Visiting the Census website and those of other federal and state agencies, and the Committee on National Statistics (http://sites.nationalacademies.org/DBASSE/CNSTAT/) and the National Institute of Statistical Sciences (http://www.niss.org/) is a good start.   Make contact with researchers at the JSM and other meetings and be on the lookout for pre- and post-doctoral positions at Census and other federal agencies.

Not just one statistics interview…John McGready is the Jon Stewart of statistics

Editor’s Note: We usually reserve Friday’s for posting Simply Statistics Interviews. This week, we have a special guest post by John McGready, a colleague of ours who has been doing interviews with many of us in the department and has some cool ideas about connecting students in their first statistics class with cutting edge researchers wrestling with many of the same concepts applied to modern problems. I’ll let him explain…

I teach a two quarter course in introductory biostatistics to master’s students in public health at Johns Hopkins.  The majority of the class is composed of MPH students, but there are also students doing professional master’s degrees in environmental health, molecular biology, health policy and mental health. Despite the short length of the course, it covers the “greatest hits” of biostatistics, encompassing everything from exploratory data analysis up through and including multivariable proportional hazards regression.  The course focus is more conceptual and less mathematical/computing centric than the other two introductory sequences taught at Hopkins: as such it has earned the unfortunate nickname “baby biostatistics” from some at the School.  This, in my opinion, is an unfortunate misnomer: statistical reasoning is often the most difficult part of learning statistics.  We spend a lot of time focusing on the current literature, and making sense or critiquing research by considering not only the statistical methods employed and the numerical findings, but also the study design and the logic of the substantive conclusions made by the study authors.

Via the course, I always hope to demonstrate the importance  biostatistics as a core driver of public health discovery, the importance of statistical reasoning in the research process, and how the fundamentals that are covered are the framework for more advanced methodology. At some point it dawned on me that the best approach for doing this was to have my colleagues speak to my students about these ideas.  Because of timing and scheduling constraints, this proved difficult to do in a live setting.  However, in June of 2012 a video recording studio opened here at the Hopkins Bloomberg School. At this point, I knew that I had to get my colleagues on video so that I could share their wealth of experiences and expertise with my students, and give the students multiple perspectives. To my delight my colleagues are very amenable to being interviewed and have been very generous with their time. I plan to continue doing the interviews so long as my colleagues are willing and the studio is available.

I have created a Youtube channel for these interviews.  At some point in the future, I plan to invite the biostatistics community as a whole to participate.  This will include interviews with visitors to my department, and submissions by biostatistics faculty and students from other schools. (I realize I am very lucky to have these facilities and video expertise at Hopkins: but many folks are tech savvy enough to film their own videos on their cameras, phones etc… in fact you have seen such creativity by the editors of this here blog). With the help of some colleagues I plan on making a complimentary website that will allow for easy submission of videos for posting, so stay tuned!

Interview with C. Titus Brown - Computational biologist and open access champion

C. Titus Brown 


C. Titus Brown is an assistant professor in the Department of Computer Science and Engineering at Michigan State University. He develops computational software for next generation sequencing and the author of the blog, "Living in an Ivory Basement". We talked to Titus about open access (he publishes his unfunded grants online!), improving the reputation of PLoS One, his research in computational software development, and work-life balance in academics. 

Read More

Interview with Lauren Talbot - Quantitative analyst for the NYC Financial Crime Task Force

Lauren Talbot


Lauren Talbot is a quantitative analyst for the New York City Financial Crime Task Force. Before working for NYC she was an analyst at Acumen LLC and got her degree in economics from Stanford University. She is a key player turning spatial data in NYC into new tools for government management. We talked to Lauren about her work, how she is using open data to do things like predict where fires might occur, and how she got started in the Financial Crime Task Force. 

SS: Do you consider yourself a statistician, computer scientist, or something else?

LT: A lot of us can’t call ourselves statisticians or computer scientists, even if that is a large part of what we do, because we never studied those fields formally. Quantitative or Data Analyst are popular job titles, but don’t really do justice to all the code infrastructure/systems you have to build and cultivate — you aren’t simply analyzing, you are matching and automating and illustrating, too. There is also a large creative aspect, because you have to figure out how to present the data in a way that is useful and compelling to people, many of whom have no prior experience working with data. So I am glad people have started using the term “Data Scientist,” even if makes me chuckle a little. Ideally I would call myself “Data Artist,” or “Data Whisperer,” but I don’t think people would take me seriously.

SS: How did you end up in the NYC Mayor’s Financial Crimes Task Force?

LT: I actually responded to a Craigslist posting. While I was still in the Bay Area (where I went to college), I was looking for jobs in NYC because I wanted to relocate back here, where I am originally from. I was searching for SAS programmer jobs, and finding a lot of stuff in healthcare that made me yawn a little. And then I had the idea to try the government jobs section. The Financial Crimes Task Force (now part of a broader citywide analytics effort under the Office of Policy and Strategic Planning) was one of two listings that popped up, and I read the description and immediately thought “dream job!” It has turned out to be even better than I imagined, because there is such a huge opportunity to make a difference — the Bloomberg administration is actually very interested in operationalizing insights from city data, so they are listening to the data people and using their work to inform agency resource allocation and even sometimes policy. My fellow are also just really fun and intelligent. I’m constantly impressed by how quickly they pick up new skills, get to the bottom of things, and jump through hoops to get things done. We also amuse and entertain each other throughout the day, which is awesome. 

SS: Can you tell us about one of the more interesting cases you have tackled and how data analysis/statistics played into the case?

LT: Since this is the NYC Mayor’s Office, dealing with city data, almost of our analyses are in some way location-based. We are trying to answer questions like, “what locations are most likely to have a catastrophic event (e.g. fire) in the near future?” This involves combining many disparate datasets such as fire data, buildings data, emergency calls data, city planning data, even garbage data. We use the tax lot ID as a common identifier, but many of the datasets do not come with this variable - they only have a text address or intersection. In many cases, the address is entered manually and has spelling mistakes. In the beginning, we were using a point-and-click geocoding tool that the city provides that reads the text field and assigns the tax lot ID. However, it was taking a long time to prepare the data so it could be used by the program, and the program was returning many errors. When we visually inspected the errors, we saw that they were caused by minor spelling differences and naming conventions. Now, almost every week we get new datasets in different structures, and we need to geocode them immediately before we can really work with them. So we needed a geocoding program that was automated and flexible, as well as capable of geocoding addresses and intersections with spelling errors and different conventions. Over the past few months, using publicly available city planning datasets and regular expressions, my side project has been creating such a program in SAS. My first test case was self-reported data created solely through user entry. This dataset, which could only be 40% geocoded using the original tool, is now 93% geocoded using the program we developed. The program is constantly evolving and improving. Now it is assigning block faces, spellchecking street and city names, and accounting for the occasional gaps in the data. We use it for everything.

SS: What are the computational tools and ideas you use most frequently in your day to day work (R, databases, regression analysis, etc.)?

LT: In the beginning, all of the data was sent to us in SQL or Excel, which was not very efficient. Now we are building a multi-agency SAS platform that can be used by programmers and non-programmers. Since there are so many data sources that can work together, having a unified platform creates new discoveries that agencies can use to be more efficient or effective. For example, a building investigator can use 311 noise complaints to uncover vacated properties that are being illegally occupied. The platform employs Palantir, which is an excellent front-end tool for playing around with the data and exploring many-to-many relationships.  Internally, my team has also used R, Python, Java, even VBA. Whatever gets the job done. We use a good mix of statistical tools. The bread and butter is usually manipulating and understanding new data sources, which is necessary before we can start trying to do something like run a multiple regression, for example. In the end, it’s really a mashup: text parsing, name matching, summarizing/describing/reporting using comparative statistics, geomapping, graphing, logistic regression, even kernel density, can all be part of the mix. Our guiding principle is to use the tool/analysis/strategy that has the highest return on investment of time and analyst resources for the city.

SS: What are the challenges of working as a quantitative analyst in a regulatory role? Is it hard to make your analyses/discoveries understandable?

LT: A lot of data analysts working in government have a difficult time getting agencies and policymakers to take action based on their work due to political priorities and organizational structures. We circumvent that issue by operating based on the needs and requests of the agencies, as well as paying attention to current events. An agency or official may come to us with a problem, and we figure out what we can deliver that will be of use to them. This starts a dialogue. It becomes an iterative process, and projects can grow and morph once we have feedback. Oftentimes, it is better to use a data-mining approach, which is more understandable to non-statisticians, rather than a regression, which can seem like a black box. For example, my colleague came up with an algorithm to target properties that were a high fire risk based on the presence of illegal conversion complaints and evidence that the property owner was under financial distress. He began with a simple list of properties for the Department of Buildings to focus on, and now they go out to inspect a list of places selected by his algorithm weekly. This video of the fire chief speaking about the project illustrates the challenges encountered and why the simpler approach was ultimately successful:http://www.youtube.com/watch?v=425QSx0U8lU&feature=youtube_gdata_player

SS: Do you have any advice for statisticians/data scientists who want to get involved with open government or government data analysis?

LT: I’ve found that people in government are actually very open to and interested in using data. The first challenge is that they don’t know that the data they have is of value. To be the most effective, you should get in touch with the people who have subject matter expertise (usually employees who have been working on the ground for some time), interview them, check your assumptions, and share whatever you’re seeing in the data on an ongoing basis. Not only will both parties learn faster, but it helps build a culture of interest in the data. Once people see what is possible, they will become more creative and start requesting deliverables that are increasingly actionable. The second challenge is getting data, and the legal and social/political issues surrounding that. The big secret is that so much useful data is actually publicly available. Do your research — you may find what you need without having to fight for it. If what you need is protected, however, consider whether the data would still be useful to you if scrubbed of personally identifiable information. Location-based data is a good example of this. If so, see whether you can negotiate with the data owner to obtain only the parts needed to do your analysis. Finally, you may find that the cohort of data scientists in government is all too sparse, and too few people “speak your language.” Reach out and align yourself with people in other agencies who are also working with data. This is a great way to gain new insight into the goals and issues of your administration, as well as friends to support and advise you as you navigate “the system.”

Interview with Amanda Cox - Graphics Editor at the New York Times

Amanda Cox 



Amanda Cox received her M.S. in statistics from the University of Washington in 2005. She then moved to the New York Times, where she is a graphics editor. She, and the graphics team at the New York Times, are responsible for many of the cool, informative, and interactive graphics produced by the Times. For example, this, this and this (the last one, Olympic Symphony, is one of my all time favorites). 

You have a background in statistics, do you consider yourself a statistician? Do you consider what you do statistics?

I don’t deal with uncertainty in a formal enough way to call what I do statistics, or myself a statistician. (My technical title is “graphics editor,” but no one knows what this means. On the good days, what we do is “journalism.”) Mark Hansen, a statistician at UCLA, has possibly changed my thinking on this a little bit though, by asking who I want to be the best at visualizing data, if not statisticians.

How did you end up at the NY Times?

In the middle of my first year of grad school (in statistics at the University of Washington), I started applying for random things. One of them was to be a summer intern in the graphics department at the Times.

How are the graphics and charts you develop different than producing graphs for a quantitative/scientific audience?


"Feels like homework" is a really negative reaction to a graphic or a story here. In practice, that means a few things: we don’t necessarily assume our audience already cares about a topic. We try to get rid of jargon, which can be useful shorthand for technical audiences, but doesn’t belong in a newspaper. Most of our graphics can stand on their own, meaning you shouldn’t need to read any accompanying text to understand the basic point. Finally, we probably pay more attention to things like typography and design, which, done properly, are really about hierarchy and clarity, and not just about making things cute. 


How do you use R to prototype graphics? 

I sketch in R, which mostly just means reading data, and trying on different forms or subsets or levels of aggregation. It’s nothing fancy: usually just points and lines and text from base graphics. For print, I will sometimes clean up a pdf of R output in Illustrator. You can see some of that in practice at chartsnthings.tumblr.com, which where one of my colleagues, Kevin Quealy, posts some of the department’s sketches. (Kevin and I are the only regular R users here, so the amount of R used on chartsnthings is not at all representative of NYT graphics as a whole.)

Do you have any examples where the R version and the eventual final web version are nearly identical?


Real interactivity changes things, so my use of R for web graphics is mostly just a proof-of-concept thing. (Sometimes I will also generate “poor-man’s interactivity,” which means hitting the pagedown key on a pdf of charts made in a for loop.) But here are a couple of proof-of-concept sketches, where the initial R output doesn’t look so different from the final web version.

The Jobless Rate for People Like You



How Different Groups Spend Their Day

You consistently produce arresting and informative graphics about a range of topics. How do you decide on which topics to tackle?

News value and interestingness are probably the two most important criteria for deciding what to work on. In an ideal world, you get both, but sometimes, one is enough (or the best you can do).

Are your project choices motivated by availability of data?

Sure. The availability of data also affects the scope of many projects. For example, the guys who work on our live election results will probably map them by county, even though precinct-level results are so much better. But precinct-level data isn’t generally available in real time.

What is the typical turn-around time from idea to completed project?

The department is most proud of some of its one-day, breaking news work, but very little of that is what I would think of as data-heavy.  The real answer to “how long does it take?” is “how long do we have?” Projects always find ways to expand to fill the available space, which often ranges from a couple of days to a couple of weeks.


Do you have any general principles for how you make complicated data understandable to the general public?

I’m a big believer in learning by example. If you annotate three points in a scatterplot, I’m probably good, even if I’m not super comfortable reading scatterplots. I also think the words in a graphic should highlight the relevant pattern, or an expert’s interpretation, and not merely say “Here is some data.” The annotation layer is critical, even in a newspaper (where the data is not usually super complicated).

What do you consider to be the most informative graphical elements or interactive features that you consistently use?

I like sliders, because there’s something about them that suggests story (beginning-middle-end), even if the thing you’re changing isn’t time. Using movement in a way that means something, like this or this, is still also fun, because it takes advantage of one of the ways the web is different from print.

Interview with Hadley Wickham - Developer of ggplot2

Hadley Wickham



Hadley Wickham is the Dobelman Family Junior Chair of Statistics at Rice University. Prior to moving to Rice, he completed his Ph.D. in Statistics from Iowa State University. He is the developer of the wildly popular ggplot2 software for data visualization and a contributor to the Ggobi project. He has developed a number of really useful R packages touching everything from data processing, to data modeling, to visualization. 

Which term applies to you: data scientist, statistician, computer
scientist, or something else?

I’m an assistant professor of statistics, so I at least partly
associate with statistics :).  But the idea of data science really
resonates with me: I like the combination of tools from statistics and
computer science, data analysis and hacking, with the core goal of
developing a better understanding of data. Sometimes it seems like not
much statistics research is actually about gaining insight into data.


You have created/maintain several widely used R packages. Can you
describe the unique challenges to writing and maintaining packages
above and beyond developing the methods themselves?

I think there are two main challenges: turning ideas into code, and
documentation and community building.

Compared to other languages, the software development infrastructure
in R is weak, which sometimes makes it harder than necessary to turn
my ideas into code. Additionally, I get less and less time to do
software development, so I can’t afford to waste time recreating old
bugs, or releasing packages that don’t work. Recently, I’ve been
investing time in helping build better dev infrastructure; better
tools for documentation [roxygen2], unit testing [testthat], package development [devtools], and creating package website [staticdocs]. Generally, I’ve
found unit tests to be a worthwhile investment: they ensure you never
accidentally recreate an old bug, and give you more confidence when
radically changing the implementation of a function.

Documenting code is hard work, and it’s certainly something I haven’t
mastered. But documentation is absolutely crucial if you want people
to use your work. I find the main challenge is putting yourself in the
mind of the new user: what do they need to know to use the package
effectively. This is really hard to do as a package author because
you’ve internalised both the motivating problem and many of the common
solutions.

Connected to documentation is building up a community around your
work. This is important to get feedback on your package, and can be
helpful for reducing the support burden. One of the things I’m most
proud of about ggplot2 is something that I’m barely responsible for:
the ggplot2 mailing list. There are now ggplot2 experts who answer far
more questions on the list than I do. I’ve also found github to be
great: there’s an increasing community of users proficient in both R
and git who produce pull requests that fix bugs and add new features.

The flip side of building a community is that as your work becomes
more popular you need to be more careful when releasing new versions.
The last major release of ggplot2 (0.9.0) broke over 40 (!!) CRAN
packages, and forced me to rethink my release process. Now I advertise
releases a month in advance, and run `R CMD check` on all downstream
dependencies (`devtools::revdep_check` in the development version), so
I can pick up potential problems and give other maintainers time to
fix any issues.


Do you feel that the academic culture has caught up with and supports
non-traditional academic contributions (e.g. R packages instead of
papers)?

It’s hard to tell. I think it’s getting better, but it’s still hard to
get recognition that software development is an intellectual activity
in the same way that developing a new mathematical theorem is. I try
to hedge my bets by publishing papers to accompany my major packages:
I’ve also found the peer-review process very useful for improving the
quality of my software. Reviewers from both the R journal and the
Journal of Statistical Software have provided excellent suggestions
for enhancements to my code.


You have given presentations at several start-up and tech companies.
Do the corporate users of your software have different interests than
the academic users?

By and large, no. Everyone, regardless of domain, is struggling to
understand ever larger datasets. Across both industry and academia,
practitioners are worried about reproducible research and thinking
about how to apply the principles of software engineering to data
analysis.


You gave one of my favorite presentations called Tidy Data/Tidy Tools
at the NYC Open Statistical Computing Meetup. What are the key
elements of tidy data that all applied statisticians should know?

Thanks! Basically, make sure you store your data in a consistent
format, and pick (or develop) tools that work with that data format.
The more time you spend munging data in the middle of an analysis, the
less time you have to discover interesting things in your data. I’ve
tried to develop a consistent philosophy of data that means when you
use my packages (particularly plyr and ggplot2), you can focus on the
data analysis, not on the details of the data format. The principles
of tidy data that I adhere to are that every column should be a
variable, every row an observation, and different types of data should
live in different data frames. (If you’re familiar with database
normalisation this should sound pretty familiar!). I expound these
principles in depth in my in-progress [paper on the
topic]


How do you decide what project to work on next? Is your work inspired
by a particular application or more general problems you are trying to
tackle?

Very broadly, I’m interested in the whole process of data analysis:
the process that takes raw data and converts it into understanding,
knowledge and insight. I’ve identified three families of tools
(manipulation, modelling and visualisation) that are used in every
data analysis, and I’m interested both in developing better individual
tools, but also smoothing the transition between them. In every good
data analysis, you must iterate multiple times between manipulation,
modelling and visualisation, and anything you can do to make that
iteration faster yields qualitative improvements to the final analysis
(that was one of the driving reasons I’ve been working on tidy data).

Another factor that motivates a lot of my work is teaching. I hate
having to teach a topic that’s just a collection of special cases,
with no underlying theme or theory. That drive lead to [stringr] (for
string manipulation) and [lubridate] (with Garrett Grolemund for working
with dates). I recently released the [httr] package which aims to do a similar thing for http requests - I think this is particularly important as more and more data starts living on the web and must be accessed through an API.


What do you see as the biggest open challenges in data visualization
right now? Do you see interactive graphics becoming more commonplace?

I think one of the biggest challenges for data visualisation is just
communicating what we know about good graphics. The first article
decrying 3d bar charts was published in 1951! Many plots still use
rainbow scales or red-green colour contrasts, even though we’ve known
for decades that those are bad. How can we ensure that people
producing graphics know enough to do a good job, without making them
read hundreds of papers? It’s a really hard problem.

Another big challenge is balancing the tension between exploration and
presentation. For explotary graphics, you want to spend five seconds
(or less) to create a plot that helps you understand the data, while you might spend
five hours on a plot that’s persuasive to an audience who
isn’t as intimately familiar with the data as you. To date, we have
great interactive graphics solutions at either end of the spectrum
(e.g. ggobi/iplots/manet vs d3) but not much that transitions from one
end of the spectrum to the other. This summer I’ll be spending some
time thinking about what ggplot2 + [d3], might
equal, and how we can design something like an interactive grammar of
graphics that lets you explore data in R, while making it easy to
publish interaction presentation graphics on the web.

Interview with Drew Conway - Author of “Machine Learning for Hackers”

Drew Conway

Drew Conway is a Ph.D. student in Politics at New York University and the co-ordinator of the New York Open Statistical Programming Meetup. He is the creator of the famous (or infamous) data science Venn diagram, the basis for our R function to determine if your a data scientist. He is also the co-author of Machine Learning for Hackers, a book of case studies that illustrates data science from a hacker’s perspective. 

Which term applies to you: data scientist, statistician, computer
scientist, or something else?
Technically, my undergraduate degree is in computer science, so that term can be applied.  I was actually double-major in CS and political science, however, so it wouldn’t tell the whole story.  I have always been most interested in answering social science problems with the tools of computer science, math and statistics.
I have struggled a bit with the term “data scientist.”  About a year ago, when it seemed to be gaining a lot of popularity, I bristled at it.  Like many others, I complained that it was simply a corporate rebranding of other skills, and that the term “science” was appended to give some veil of legitimacy.  Since then, I have warmed to the term, but—-as is often the case—-only when I can define what data science is in my own terms.  Now, I do think of what I do as being data science, that is, the blending of technical skills and tools from computer science, with the methodological training of math and statistics, and my own substantive interest in questions about collective action and political ideology.
I think the term is very loaded, however, and when many people invoke it they often do so as a catch-all for talking about working with a certain a set of tools: R, map-reduce, data visualization, etc.  I think this actually hurts the discipline a great deal, because if it is meant to actually be a science the majority of our focus should be on questions, not tools.
 
You are in the department of politics? How is it being a “data
person” in a non-computational department?
Data has always been an integral part of the discipline, so in that sense many of my colleagues are data people.  I think the difference between my work and the work that many other political scientist do is simply a matter of where and how I get my data.  
For example, a traditional political science experiment might involve a small set of undergraduates taking a survey or playing a simple game on a closed network.  That data would then be collected and analyzed as a controlled experiment.  Alternatively, I am currently running an experiment wherein my co-authors and I are attempting to code text documents (political party manifestos) with ideological scores (very liberal to very conservative).  To do this we have broken down the documents into small chunks of text and are having workers on Mechanical Turk code single chunks—rather than the whole document at once.  In this case the data scale up very quickly, but by aggregating the results we are able to have a very different kind of experiment with much richer data.
At the same time, I think political science—-and perhaps the social sciences more generally—suffer from a tradition of undervaluing technical expertise. In that sense, it is difficult to convince colleagues that developing software tools is important. 
 
Is that what inspired you to create the New York Open Statistical Meetup?
I actually didn’t create the New York Open Statistical Meetup (formerly the R meetup).  Joshua Reich was the original founder, back in 2008, and shortly after the first meeting we partnered and ran the Meetup together.  Once Josh became fully consumed by starting / running BankSimple I took it over by myself.  I think the best part about the Meetup is how it brings people together from a wide range of academic and industry backgrounds, and we can all talk to each other in a common language of computational programming.  The cross-pollination of ideas and talents is inspiring.
We are also very fortunate in that the community here is so strong, and that New York City is a well traveled place, so there is never a shortage of great speakers.
 
You created the data science Venn diagram. Where do you fall on the diagram?
Right at the center, of course! Actually, before I entered graduate school, which is long before I drew the Venn diagram, I fell squarely in the danger zone.  I had a lot of hacking skills, and my work (as an analyst in the U.S. intelligence community) afforded me a lot of substantive expertise, but I had little to no formal training in statistics.  If you could describe my journey through graduate school within the framework of the data science Venn diagram, it would be about me trying to pull myself out of the danger zone by gaining as much math and statistics knowledge as I can.  
 
I see that a lot of your software (including R packages) are on Github. Do you post them on CRAN as well? Do you think R developers will eventually move to Github from CRAN?
I am a big proponent of open source development, especially in the context of sharing data and analyses; and creating reproducible results.  I love Github because it creates a great environment for following the work of other coders, and participating in the development process.  For data analysis, it is also a great place to upload data and R scripts and allow the community to see how you did things and comment.  I also think, however, that there is a big opportunity for a new site—-like Github—-to be created that is more tailored for data analysis, and storing and disseminating data and visualizations.
I do post my R packages to CRAN, and I think that CRAN is one of the biggest strengths of the R language and community.  I think ideally more package developers would open their development process, on Github or some other social coding platform, and then push their well-vetted packages to CRAN.  This would allow for more people to participate, but maintain the great community resource that CRAN provides. 
 
What inspired you to write, “Machine Learning for Hackers”? Who
was your target audience?
A little over a year ago John Myles White (my co-author) and I were having a lot of conversations with other members of the data community in New York City about what a data science curriculum would look like.  During these conversations people would always cite the classic text; Elements of Statistical Learning, Pattern Recognition and Machine Learning, etc., which are excellent and deep treatments of the foundational theories of machine learning.  From these conversations it occurred to us that there was not a good text on machine learning for people who thought more algorithmically.  That is, there was not a text for “hackers,” people who enjoy learning about computation by opening up black-boxes and getting their hands dirty with code.
It was from this idea that the book, and eventually the title, were borne.  We think the audience for the book is anyone who wants to get a relatively broad introduction to some of the basic tools of machine learning, and do so through code—-not math.  This can be someone working at a company with data that wants to add some of these tools to their belt, or it can be an undergraduate in a computer science or statistics program that can relate to the material more easily through this presentation than the more theoretically heavy texts they’re probably already reading for class. 

Interview with Amy Heineike - Director of Mathematics at Quid

Amy Heineike

Amy Heineike is the Director of Mathematics at Quid, a startup that seeks to understand technology development and dissemination through data analysis. She was the first employee at Quid, where she helped develop their technology early on. She has been recognized as one of the top Big Data Scientists. As a part of our ongoing interview series talked to Amy about data science, Quid, and how statisticians can get involved in the tech scene. 

Which term applies to you: data scientist, statistician, computer scientist, or something else?
Data Scientist fits better than any, because it captures the mix of analytics, engineering and product management that is my current day to day.  
When I started with Quid I was focused on R&D - developing the first prototypes of what are now our core analytics technologies, and working to define and QA new data streams.  This required the analysis of lots of unstructured data, like news articles and patent filings, as well as the end visualisation and communication of the results.  
After we raised VC funding last year I switched to building our data science and engineering teams out.  These days I jump from conversations with the team about ideas for new analysis, to defining refinements to our data model, to questions about scalable architecture and filling out pivotal tracker tickets.  The core challenge is translating the vision for the product back to the team so they can build it.
 
 How did you end up at Quid?
In my previous work I’d been building models to improve our understanding of complex human systems - in particular the complex interaction of cities and their transportation networks in order to evaluate the economic impacts of, Crossrail, a new train line across London, and the implications of social networks on public policy.  Through this work it became clear that data was the biggest constraint - I became fascinated by a quest to find usable data for these questions - and thats what led me to Silicon Valley.  I knew the founders of Quid from University, and approached them with the idea of analysing their data according to ideas I’d had - especially around network analysis - and the initial work we collaborated on became core to the founding techology of Quid.
Who were really good mentors to you? What were the qualities that helped you? 
I’ve been fortunate to work with some brilliant people in my career so far.  While I still worked in London I worked closely with two behavioural economists - Paul Ormerod, who’s written some fantastic books on the subject (mostly recently Why Things Fail), and Bridget Rosewell, until recently the Chief Economist to the Greater London Authority (the city government for London).  At Quid I’ve had a very productive collaboration with Sean Gourley, our CTO.
One unifying characteristic of these three is their ability to communicate complex ideas in a powerful way to a broad audience.  Its an incredibly important skill, a core part of analytics work is taking the results to where they are needed which is often beyond those who know the technical details, to those who care about the implications first.
 
How does Quid determine relationships between organizations and develop insight based on data? 
The core questions our clients ask us are around how technology is changing and how this impacts their business.  Thats a really fascinating and huge question that requires not just discovering a document with the answer in it, but organizing lots and lots of pieces of data to paint a picture of the emergent change.  What we can offer is not only being able to find a snapshot of that, but also being able to track how it changes over time.
We organize the data firstly through the insight that much disruptive technology emerges in organizations, and that the events that occur between and to organizations are a fantastic way to signal both the traction of technologies and to observe strategic decision making by key actors.
The first kind of relationship thats important is of the transactional type, who is acquiring, funding or partnering with who, and the second is an estimate of the technological clustering of organizations, what trends do particular organizations represent.  Both of these can be discovered through documents about them, including in government filings, press releases and news, but requires analysis of unstructured natural language.  
 
We’ve experimented with some very engaging visualisations of the results, and have had particular success with network visualisations, which are a very powerful way of allowing people to interact with a large amount of data in a quite playful way.  You can see some of our analyses in the press links at http://quid.com/in-the-news.php
What skills do you think are most important for statisticians/data scientists moving into the tech industry?
Technical statistical chops are the foundation. You need to be able to take a dataset and discover and communicate what’s interesting about it for your users.  To turn this into a product requires understanding how to turn one-off analysis into something reliable enough to run day after day, even as the data evolves and grows, and as different users experience different aspects of it.  A key part of that is being willing to engage with questions about where the data comes from (how it can be collected, stored, processed and QAed on an ongoing basis), how the analytics will be run (how will it be tested, distributed and scaled) and how people interact with it (through visualisations, UI features or static presentations?).  
For your ideas to become great products, you need to become part of a great team though!  One of the reasons that such a broad set of skills are associated with Data Science is that there are a lot of pieces that have to come together for it to all work out - and it really takes a team to pull it off.  Generally speaking, the earlier stage the company that you join, the broader the range of skills you need, and the more scrappy you need to be about getting involved in whatever needs to be done.  Later stage teams, and big tech companies may have roles that are purer statistics.
 
Do you have any advice for grad students in statistics/biostatistics on how to get involved in the start-up community or how to find a job at a start-up? 
There is a real opportunity for people who have good statistical and computational skills to get into the startup and tech scenes now.  Many people in Data Science roles have statistics and biostatistics backgrounds, so you shouldn’t find it hard to find kindred spirits.

We’ve always been especially impressed with people who have built software in a group and shared or distributed that software in some way.  Getting involved in an open source project, working with version control in a team, or sharing your code on github are all good ways to start on this.
Its really important to be able to show that you want to build products though.  Imagine the clients or users of the company and see if you get excited about building something that they will use.  Reach out to people in the tech scene, explore who’s posting jobs - and then be able to explain to them what it is you’ve done and why its relevant, and be able to think about their business and how you’d want to help contribute towards it.  Many companies offer internships, which could be a good way to contribute for a short period and find out if its a good fit for you.

Interview With Joe Blitzstein

Joe Blitzstein
Joe Blitzstein is Professor of the Practice in Statistics at Harvard University and co-director of the graduate program. He moved to Harvard after obtaining his Ph.D. with Persi Diaconis at Stanford University. Since joining the faculty at Harvard, he has been immortalized in Youtube prank videos, been awarded a “favorite professor” distinction four times, and performed interesting research on the statistical analysis of social networks. Joe was also the first person to discover our blog on Twitter. You can find more information about him on his personal website. Or check out his Stat 110 class, now available from iTunes!
Which term applies to you: data scientist/statistician/analyst?

Statistician, but that should and does include working with data! I
think statistics at its best interweaves modeling, inference,
prediction, computing, exploratory data analysis (including
visualization), and mathematical and scientific thinking. I don’t
think “data science” should be a separate field, and I’m concerned
about people working with data without having studied much statistics
and conversely, statisticians who don’t consider it important ever to
look at real data. I enjoyed the discussions by Drew Conway and on
your blog (at http://www.drewconway.com/zia/?p=2378 and
http://simplystatistics.tumblr.com/post/11271228367/datascientist )
and think the relationships between statistics, machine learning, data
science, and analytics need to be clarified.

How did you get into statistics/data science (e.g. your history)?

I always enjoyed math and science, and became a math major as an
undergrad Caltech partly because I love logic and probability and
partly because I couldn’t decide which science to specialize in. One
of my favorite things about being a math major was that it felt so
connected to everything else: I could often help my friends who were
doing astronomy, biology, economics, etc. with problems, once they had
explained enough so that I could see the essential pattern/structure
of the problem. At the graduate level, there is a tendency for math to
become more and more disconnected from the rest of science, so I was
very happy to discover that statistics let me regain this, and have
the best of both worlds: you can apply statistical thinking and tools
to almost anything, and there are so many opportunities to do things
that are both beautiful and useful.

Who were really good mentors to you? What were the qualities that really
helped you?

I’ve been extremely lucky that I have had so many inspiring
colleagues, teachers, and students (far too numerous to list), so I
will just mention three. My mother, Steffi, taught me at an early age
to love reading and knowledge, and to ask a lot of “what if?”
questions. My PhD advisor, Persi Diaconis, taught me many beautiful
ideas in probability and combinatorics, about the importance of
starting with a simple nontrivial example, and to ask a lot of “who
cares?” questions. My colleague Carl Morris taught me a lot about how
to think inferentially (Brad Efron called Carl a “natural”
statistician in his interview at
http://www-stat.stanford.edu/~ckirby/brad/other/2010Significance.pdf ,
by which I think he meant that valid inferential thinking does not
come naturally to most people), about parametric and hierarchical
modeling, and to ask a lot of “does that assumption make sense in the
real world?” questions.

How do you get students fired up about statistics in your classes?

Statisticians know that their field is both incredibly useful in the
real world and exquisitely beautiful aesthetically. So why isn’t that
always conveyed successfully in courses? Statistics is often
misconstrued as a messy menagerie of formulas and tests, rather than a
coherent approach to scientific reasoning based on a few fundamental
principles. So I emphasize thinking and understanding rather than
memorization, and try to make sure everything is well-motivated and
makes sense both mathematically and intuitively. I talk a lot about
paradoxes and results which at first seem counterintuitive, since
they’re fun to think about and insightful once you figure out what’s
going on.

And I emphasize what I call “stories,” by which I mean an
application/interpretation that does not lose generality. As a simple
example, if X is Binomial(m,p) and Y is Binomial(n,p) independently,
then X+Y is Binomial(m+n,p). A story proof would be to interpret X as
the number of successes in m Bernoulli trials and Y as the number of
successes in n different Bernoulli trials, so X+Y is the number of
successes in the m+n trials. Once you’ve thought of it this way,
you’ll always understand this result and never forget it. A
misconception is that this kind of proof is somehow less rigorous than
an algebraic proof; actually, rigor is determined by the logic of the
argument, not by how many fancy symbols and equations one writes out.

My undergraduate probability course, Stat 110, is now worldwide
viewable for free on iTunes U at
http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=495213607
with 34 lecture videos and about 250 practice problems with solutions.
I hope that will be a useful resource, but in any case looking through
those materials says more about my teaching style than anything I can
write here does.

What are your main research interests these days?

I’m especially interested in the statistics of networks, with
applications to social network analysis and in public health. There is
a tremendous amount of interest in networks these days, coming from so
many different fields of study, which is wonderful but I think there
needs to be much more attention devoted to the statistical issues.
Computationally, most network models are difficult to work with since
the space of all networks is so vast, and so techniques like Markov
chain Monte Carlo and sequential importance sampling become crucial;
but there remains much to do in making these algorithms more efficient
and in figuring out whether one has run them long enough (usually the
answer is “no” to the question of whether one has run them long
enough). Inferentially, I am especially interested in how to make
valid conclusions when, as is typically the case, it is not feasible
to observe the full network. For example, respondent-driven sampling
is a link-tracing scheme being used all over the world these days to
study so-called “hard-to-reach” populations, but much remains to be
done to know how best to analyze such data; I’m working on this with
my student Sergiy Nesterko. With other students and collaborators I’m
working on various other network-related problems. Meanwhile, I’m also
finishing up a graduate probability book with Carl Morris,
"Probability for Statistical Science," which has quite a few new
proofs and perspectives on the parts of probability theory that are
most useful in statistics.

You have been immortalized in several Youtube videos. Do you think this
helped make your class more “approachable”?

There were a couple strange and funny pranks that occurred in my first
year at Harvard. I’m used to pranks since Caltech has a long history
and culture of pranks, commemorated in several “Legends of Caltech”
volumes (there’s even a movie in development about this), but pranks
are quite rare at Harvard. I try to make the class approachable
through the lectures and by making sure there is plenty of support,
help, and encouragement is available from the teaching assistants and
me, not through YouTube, but it’s fun having a few interesting
occasions from the history of the class commemorated there.

Interview with Nathan Yau of FlowingData

Nathan Yau

Nathan Yau is a graduate student in statistics at UCLA and the author of the extremely popular data visualization blog flowingdata.com. He recently published a book Visualize This - a really nice guide to modern data visualization using R, Illustrator and Javascript - which should be on the bookshelf of any statistician working on data visualization. 

Read More