Sunday, January 25, 2015

Remain skeptical when looking at slick data-mining reports.

Nobel-Prize economist Ronald Coase warned, "If you torture the data long enough, it will confess."  Our current infatuation Big Data sometimes sometimes leads us down this road of unjustified exuberance. Large data sets are more accessible than ever. And tinkering with them can be really fun!

The Sunday Review section of today's New York Times provides a timely illustration. Researchers for the Times went out looking for correlations between street name and home value.¹  They package their findings behind slick, interactive graphics. We can type in our street name and see how the price of homes located on streets with that name compare with national averages.

This topic entices us immediately. Our home — for many of us — stores most of our household wealth. We are also less than a dozen years removed from the roller-coaster mass hysteria of a nationwide housing bubble. Many of us might still be more nervous than ever as we watch our prize assets' unsteady recovery of pre-crash value. 

The authors' first example gives us a red flag. They assert, "On average, homes on named streets are 2 percent more valuable than those on numbered streets." We should immediately ask ourselves, "How significant is this?" Should the street name influence my next home-buying decision?

The street-name to home-value analysis stumbles into three common Big-Data pitfalls. I mostly focus here on the first: They only tell us about averages. Averages — or means — are among the most-abused of all statistics. They can become outright obfuscatory when offered in absence of other statistics. "What is the standard deviation," should be our reflexive response whenever someone offers an average during a serious conversation.

The Times authors don't give us any standard deviations. So what can we do?  We can ask, "How significant is a 2% difference in home values in your study?"  This is easy to test. We can get a pretty good idea on our own, even without access to their data.

I did exactly this to produce Figure 1. I am trying to answer the question,"At what home-price difference might different street names become different?" To do this, I assumed that the house prices fit a normal distribution and that their standard deviations are the same. These assumptions can in themselves be perilous? But they give me a good sanity check.

I used a statistical test called a "Kolmogorov-Smirnov" test for my little experiment here. This is a statistical method that tells whether two data sets are from the same or different statistical distributions. Rascoff and Humpries — authors of the Times article — give us an example in which houses on numbered streets are — on the average — worth two percent less than those on named streets.  

I test this with two 1,000-point random normal data sets. One data set has a mean of zero. The other has a mean of 0.02. I run this test 1,000 times. I calculate the Kolmogorov-Smirnov statistic for each of these 1,000 trials. I then calculate the average for the 1,000 trials. I also do this for 999 other cases in which the second data set has a different mean. This amounts to a million tests using a total of two billion data points.

Figure 1 shows the result. I believe that street name might make a difference if the "p-value" from my Kolmogorov-Smirnov test is less than about 0.05. What does Figure 1 show? At a difference of two percent my p-value is much closer to 0.5 than to 0.05. 

Street name should only really matter if average house prices differ by about 14%. Otherwise, the street name does not matter!  There is no evidence that the statistical distribution for house prices on named streets is different from that for house prices on numbered streets. A two-percent difference is well within what I expect from random chance!
Figure 1 — p-value results from Kolmogorov-Smirnov test comparing random normal samples. The plot contains the averages of 1,000 trials each at 1,000 distinct means.

So, Rascoff and Humphries commit three primary Big-Data offenses. I show here their abuse of means. Their tool also doesn't account for problems with small samples. I type in my street name and receive the happy news that houses on my street name are worth on average 105% more than the national average!  But there are only two homes per 100,000 with that address. The sample size cannot be significant.

Finally, the authors' admittedly fun exercise fails to consider other factors. No rational home buyer would use street name as a determining factor when shopping. They consider school districts, crime rates, tax rates, and a myriad of other factors. While their exercise is entertaining, it illustrates the pitfalls we must consider when using analytics for more serious work.






References

¹ S. Rascoff and S. Humphries, "The Secrets of Street Names and Home Values," Sunday Review, The New York Times, January 24, 2015, http://www.nytimes.com/2015/01/25/opinion/sunday/the-secrets-of-street-names-and-home-values.html?ref=opinion.


© The Quant's Prism, 2015

No comments:

Post a Comment