Today, I was reminded about a major gotcha in ggplot2 (by making the mistake).
How do you set limits on plots with ggplot2?
Most common is to do simply:
The problem with this is that ggplot removes the data outside the range before any plot happens.
This is fine quite often – say you are just doing a simple scatter plot.
The problem is when your plot requires some calculation on top – for example boxplots. In that case: 1. data gets removed, 2. quantiles/IQR/outliers get calculated 3. plot is created.
So all statistics applies only for the plotted data, not the whole data set. This is rarely what you want and can be deeply misleading.
Most of the time you just want to “zoom” in on to a range.
The proper command is:
coord_cartesian(xlim = c(0, 1))
I like ggplot, but it never stopped feeling odd.
It “feels” alien – I think the reason being, that Hadley Wickham is an alien. That’s the only possible explanation for his out-of-this-world productivity and genius. Unfortunately, he wrote ggplot shortly after he landed and therefore it still bears strong marks of his home planet.
Contrast this with dplyr, one of the most ergonomic designs around. In order to fit in, Hadley studied the human race and got inspired by anthropological studies of early paleolithic unix designs. Our ancestors used piping as the main data manipulation model in small hunter-gather data science teams. Piping is hard wired in our brains and therefore feels so natural (as with other results of evolutionary psychology – caveat lector).
Jokes aside. This is a really, really dangerous choice for default.