Big data is all the rage right now. Everything that wasn’t called big data before the catchphrase became a catchphrase is now big data. While size of a database is certainly part of what makes data “big,” that in itself isn’t doesn’t make large data the phenomenon that the collective consciousness has coined “big data.” I think one of the defining characteristics of the big data revolution is the ability to repurpose the large amounts of data that exists all around us into meaningful and actionable insight. The scale of the data is just one factor that can help us leverage the seemingly cacophonous into such insight. While some of this insight is correlational (seriously, play with google correlate if you haven’t already), real actionable insights generally result from a causal relationship uncovered in the data.
It goes without saying that drawing causal inferences from big data is a tall order. Just to be able to draw correlational conclusions requires gathering, munging, reshaping, merging, cleansing, and analysis of raw, unkempt data-like things – and perhaps a prayer just to be safe. But how do we make the leap from correlation to causation without being able to design and execute an experiment? This generally requires a mixture of ingenuity, deep understanding of the data generating process and institutional insights, and an understanding of a few standard empirical causality arguments. I will present 3 examples of different types of causality arguments, all drawn from current cutting edge research on the topic of online reviews in this and the next blog post.
Example 1 – Review, reputation, and revenue: the case of Yelp.com, Michael Luca (2011 HBS working paper series)
This interesting working paper by Michael Luca investigates the effect of Yelp on restaurant revenues. The 2 primary datasets used here are a history of Yelp reviews of Seattle area restaurants and restaurant revenue data from the Washington State Department of Revenue. So if we run a regression as follows, why can’t we interpret the effect of Yelp on Revenue, the parameter b, as a causal effect?
Revenue_jt = u_j + q_t +b*Yelp_jt + e_jt
Despite having controlled for the fixed effect for restaurants, u_j, and the seasonal variation, q_t,we still cannot interpret this regression as a causal statement because there could be something unobserved that simultaneously impacts Yelp and Revenue, i.e. reputation. We can’t say that changes in Yelp ratings actually causes revenue gains, because it could be reputation that increases revenues and Yelp is just a measure of that underlying reputation. Sounds like we’re splitting hairs, right? No! This is an important distinction, because if we can make a causal statement we know that we just have to increase Yelp ratings to increase revenues. On the other hand, if revenues are actually caused by some latent reputation, we cannot increase revenues even if we wrote 1 million fake positive reviews on Yelp (not that I would ever advocate this as a strategy, more on this next time…).
So what can we do? We need to identify something that changes Yelp ratings that is uncorrelated with changes in reputation. Luca’s idea for this is based on his observation that Yelp rounds average ratings to the nearest half star. For example, a restaurant rated at 4.24 will be rounded down to a 4 while a restaurant rated at a 4.26 will be rounded up to a 4.5. While the raw average might be representative of the restaurant’s reputation, the rounding is certainly idiosyncratic to Yelp. Now, if we can see big differences in revenue for restaurants clustered around the rounding thresholds on Yelp, then we can more plausibly make the causal statement that Yelp increases restaurant revenues. This is known in econometrics as a regression discontinuity design, or RDD. Luca creates a dummy variable to signal whether a restaurant is just above or below a rounding threshold like 3.25, 3.75, and 4.25. Adding this variable in the specification, he is able to estimate a statistically significant effect of rounding on restaurant revenues.
When done properly, RDD can be about as definitive an empirical causality test as there is. But, can we poke holes at this argument? While it is pretty convincing that the rounding of stars is a nice idiosyncratic feature of Yelp, this argument isn’t bullet-proof. First, the revenue data from the state is recorded at the quarterly level. We can’t actually identify revenue changes within a quarter (when likely most of the rounding occurs given the frequency of reviews). In other words, Yelp reviews are probably rounded more frequently than restaurants report their sales to the government. One possible way to address is this is perhaps to calculate a within-quarter % of days that a restaurant’s rating is within the threshold of rounding up or down. In other words, convert the discontinuity to a continuous variable. This isn’t ideal, but given the data, might be the only actionable way to address the issue. Luca actually does something similar by only including a dummy for rounding up or down within a quarter if half the days of that quarter are within the rounding threshold.
Second, as restaurants receive more reviews, the rounding of stars should become less frequent. In other words, most of the variation in rounding occurs at the beginning of a restaurant’s life-cycle. One can imagine that this time period might be the most volatile in a restaurant’s quality and reputation. Combined with the fact that we only have quarterly sales data, it isn’t implausible to imagine that, for much of the data, quarterly star rounding dummies are correlated with actual discontinuous jumps in quality and reputation. Luca actually controls for this somewhat by including a control for the continuous quarterly average of ratings. While controlling for Yelp ratings does a bit to quell the issue, it also assumes that Yelp is an accurate measure of quality and reputation (which it may be). More importantly, though, this control does not account for the impact of critic reviews and publicity through other outlets (advertising) that is more likely to occur at the beginning of a restaurant’s life-cycle, likely to be correlated with revenue growth, and likely to be correlated with the probability of being in the rounding threshold.
Despite these shortcomings (and perhaps others?), this is one of the papers on online reviews that I think most definitively shows a causal impact of reviews on sales. Next time, I will talk about causal inference with big data in the context of fake reviews and management response.
3 thoughts on “Causal inference with big data – Part 1”