How do I know if my figure is too complicated?

One of the key things every statistician needs to learn is how to create informative figures and graphs. Sometimes, it is easy to use off-the-shelf plots like barplots, histograms, or if one is truly desperate a pie-chart

But sometimes the information you are trying to communicate requires the development of a new graphic. I am currently working on a project with a graduate student where the standard illustration are Venn Diagrams - including complicated Venn Diagrams with 5 or 10 circles. 

As we were thinking about different ways of illustrating our data, I started thinking about what are the key qualities of a graphic and how do I know if it is too complicated. I realized that:

  1. Ideally just looking at the graphic one can intuitively understand what is going on, but sometimes for more technical/involved displays this isn’t possible
  2. Alternatively, I think a good plot should be able to be explained in 2 sentences or less. I think that is true for pretty much every plot I use regularly. 
  3. That isn’t including describing what different colors/sizes/shapes specifically represent in any particular version of the graphic. 

I feel like there is probably something to this in the Grammar of Graphics or in some of William Cleveland’s work. But this is one of the first times I’ve come up with a case where a new, generalizable, type of graph needs to be developed. 

Sunday data/statistics link roundup (4/22)

  1. Now we know who is to blame for the pie chart. I had no idea it had been around, straining our ability to compare relative areas, since 1801. However, the same guy (William Playfair) apparently also invented the bar chart. So he wouldn’t be totally shunned by statisticians. (via Leonid K.)
  2. A nice article in the Guardian about the current group of scientists that are boycotting Elsevier. I have to agree with the quote that leads the article, “All professions are conspiracies against the laity.” On the other hand, I agree with Rafa that academics are partially to blame for buying into the closed access hegemony. I think more than a boycott of a single publisher is needed; we need a change in culture. (first link also via Leonid K)
  3. A blog post on how to add a transparent image layer to a plot. For some reason, I have wanted to do this several times over the last couple of weeks, so the serendipity of seeing it on R Bloggers merited a mention. 
  4. I agree the Earth Institute needs a better graphics advisor. (via Andrew G.)
  5. A great article on why multiple choice tests are used - they are an easy way to collect data on education. But that doesn’t mean they are the right data. This reminds me of the Tukey quote: “The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data”. It seems to me if you wanted to have a major positive impact on education right now, the best way would be to develop a new experimental design that collects the kind of data that really demonstrates mastery of reading/math/critical thinking. 
  6. Finally, a bit of a bleg…what is the best way to do the SVD of a huge (think 1e6 x 1e6), sparse matrix in R? Preferably without loading the whole thing into memory…

Sunday data/statistics link roundup (4/15)

  1. Incredibly cook, dynamic real-time maps of wind patterns in the United States. (Via Flowing Data)
  2. A d3.js coding tool that updates automatically as you update the code. This is going to be really useful for beginners trying to learn about D3. Real time coding (Via Flowing Data)
  3. An interesting blog post describing why the winning algorithm in the Netflix prize hasn’t actually been implemented! It looks like it was too much of an engineering hassle. I wonder if this will make others think twice before offering big sums for prizes like this. Unless the real value is advertising…(via Chris V.)
  4. An article about a group at USC that plans to collect all the information from apps that measure heart beats. Their project is called everyheartbeat. I think this is a little bit pre-mature, given the technology, but certainly the quantified self field is heating up. I wonder how long until the target audience for these sorts of projects isn’t just wealthy young technofiles? 
  5. A really good deconstruction of a recent paper suggesting that the mood on Twitter could be used to game the stock market. The author illustrates several major statistical flaws, including not correcting for multiple testing, an implausible statistical model, and not using a big enough training set. The scary thing is apparently a hedge fund is teaming up with this group of academics to try to implement their approach. I wouldn’t put my money anywhere they can get their hands on it. This is just one more in the accelerating line of results that illustrate the critical need for statistical literacy both among scientists and in the general public.

R and the little data scientist’s predicament

I just read this fascinating post on _why, apparently a bit of a cult hero among enthusiasts of the Ruby programming language. One of the most interesting bits was The Little Coder’s Predicament, which boiled down essentially says that computer programming languages have grown too complex - so children/newbies can’t get the instant gratification when they start programming. He suggested a simplified “gateway language” that would get kids fired up about programming, because with a simple line of code or two they could make the computer do things like play some music or make a video. 

I feel like there is a similar ramp up with data scientists. To be able to do anything cool/inspiring with data you need to know (a) a little statistics, (b) a little bit about a programming language, and (c) quite a bit about syntax. 

Wouldn’t it be cool if there was an R package that solved the little data scientist’s predicament? The package would have to have at least some of these properties:

  1. It would have to be easy to load data sets, one line of not complicated code. You could write an interface for RCurl/read.table/download.file for a defined set of APIs/data sets so the command would be something like: load(“education-data”) and it would load a bunch of data on education. It would handle all the messiness of scraping the web, formatting data, etc. in the background. 
  2. It would have to have a lot of really easy visualization functions. Right now, if you want to make pretty plots with ggplot(), plot(), etc. in R, you need to know all the syntax for pch, cex, col, etc. The plotting function should handle all this behind the scenes and make super pretty pictures. 
  3. It would be awesome if the functions would include some sort of dynamic graphics (with svgAnnotation or a wrapper for D3.js). Again, the syntax would have to be really accessible/not too much to learn. 

That alone would be a huge start. In just 2 lines kids could load and visualize cool data in a pretty way they could show their parents/friends. 

Sunday data/statistics link roundup (3/4)

  1. A cool article on Github by the folks at Wired. I’m starting to think the fact that I’m not on Github is a serious dent in my nerd cred. 
  2. Datawrapper - a less intensive, but less flexible open source data visualization creator. I have seen a few of these types of services starting to pop up. I think that some statistics training should be mandatory before people use them. 
  3. An interesting blog post with the provocative title, “Why bother publishing in a journal” The story he describes works best if you have a lot of people who are interested in reading what you put on the internet. 
  4. A post on stackexchange comparing the machine learning and statistics cultures. 
  5. Stackoverflow is a great place to look for R answers. It is the R mailing list, minus the flames…
  6. Roger’s posts on Beijing air pollution are worth another read if you missed them. Particularly this one, where he computes the cigarette equivalent of the air pollution levels. 

A wordcloud comparison of the 2011 and 2012 #SOTU

I wrote a quick (and very dirty) R script for creating a comparison cloud and a commonality cloud for President Obama’s 2011 and 2012 State of the Union speeches*. The cloud on the left shows words that have different frequencies between the two speeches and the cloud on the right shows the words in common between the two speeches. Here is a higher resolution version. 

The focus on jobs hasn’t changed much. But it is interesting how the 2012 speech seems to focus more on practical issues (tax, pay, manufacturing, oil) versus more emotional issues in 2011 (future, schools, laughter, success, dream). 

*The wordcloud R package does all the heavy lifting.

An R function to map your Twitter Followers

I wrote a little function to make a personalized map of who follows you or who you follow on Twitter. The idea for this function was inspired by some plots I discussed in a previous post. I also found a lot of really useful code over at flowing data here

The function uses the packages twitteR, maps, geosphere, and RColorBrewer. If you don’t have the packages installed, when you source the twitterMap code, it will try to install them for you. The code also requires you to have a working internet connection. 

One word of warning is that if you have a large number of followers or people you follow, you may be rate limited by Twitter and unable to make the plot.

To make your personalized twitter map, first source the function:

> source(“http://biostat.jhsph.edu/~jleek/code/twitterMap.R”)

The function has the following form: 

twitterMap <- function(userName,userLocation=NULL,fileName=”twitterMap.pdf”,nMax = 1000,plotType=c(“followers”,”both”,”following”))

with arguments:

  • userName - the twitter username you want to plot
  • userLocation - an optional argument giving the location of the user, necessary when the location information you have provided Twitter isn’t sufficient for us to find latitude/longitude data
  • fileName - the file where you want the plot to appear
  • nMax - The maximum number of followers/following to get from Twitter, this is implemented to avoid rate limiting for people with large numbers of followers. 
  • plotType - if “both” both followers/following are plotted, etc. 

Then you can create a plot with both followers/following like so: 

> twitterMap(“simplystats”)

Here is what the resulting plot looks like for our Twitter Account:

If your location can’t be found or latitude longitude can’t be calculated, you may have to chose a bigger city near you. The list of cities used by twitterMap can be found like so:

>library(maps)

>data(world.cities)

>grep(“Baltimore”, world.cities[,1])

If your city is in the database, this will return the row number of the world.cities data frame corresponding to your city. 

If you like this function you may also like our function to determine if you are a data scientist or to analyze your Google Scholar citations page.
Update: The bulk of the heavy lifting done by these functions is performed by Jeff Gentry’s very nice twitteR package and code put together by Nathan Yau over at FlowingData. This is really an example of standing on the shoulders of giants. 

Interview with Nathan Yau of FlowingData

Nathan Yau

Nathan Yau is a graduate student in statistics at UCLA and the author of the extremely popular data visualization blog flowingdata.com. He recently published a book Visualize This - a really nice guide to modern data visualization using R, Illustrator and Javascript - which should be on the bookshelf of any statistician working on data visualization. 

Read More

Visualizing Yahoo Email

Here is a cool page where yahoo shows you the email it is processing in real time. It includes a visualization of the most popular words in emails at a given time. A pretty neat tool and definitely good for procrastination, but I’m not sure what else it is good for…

Spectacular Plots Made Entirely in R

When doing data analysis, I often create a set of plots quickly just to explore the data and see what the general trends are. Later I go back and fiddle with the plots to make them look pretty for publication. But some people have taken this to the next level. Here are two plots made entirely in R:

The descriptions of how they were created are here and here.

Related: Check out Roger’s post on R colors and my post on APIs