Sunday, January 25, 2015

Remain skeptical when looking at slick data-mining reports.

Nobel-Prize economist Ronald Coase warned, "If you torture the data long enough, it will confess."  Our current infatuation Big Data sometimes sometimes leads us down this road of unjustified exuberance. Large data sets are more accessible than ever. And tinkering with them can be really fun!

The Sunday Review section of today's New York Times provides a timely illustration. Researchers for the Times went out looking for correlations between street name and home value.¹  They package their findings behind slick, interactive graphics. We can type in our street name and see how the price of homes located on streets with that name compare with national averages.

This topic entices us immediately. Our home — for many of us — stores most of our household wealth. We are also less than a dozen years removed from the roller-coaster mass hysteria of a nationwide housing bubble. Many of us might still be more nervous than ever as we watch our prize assets' unsteady recovery of pre-crash value. 

The authors' first example gives us a red flag. They assert, "On average, homes on named streets are 2 percent more valuable than those on numbered streets." We should immediately ask ourselves, "How significant is this?" Should the street name influence my next home-buying decision?

The street-name to home-value analysis stumbles into three common Big-Data pitfalls. I mostly focus here on the first: They only tell us about averages. Averages — or means — are among the most-abused of all statistics. They can become outright obfuscatory when offered in absence of other statistics. "What is the standard deviation," should be our reflexive response whenever someone offers an average during a serious conversation.

The Times authors don't give us any standard deviations. So what can we do?  We can ask, "How significant is a 2% difference in home values in your study?"  This is easy to test. We can get a pretty good idea on our own, even without access to their data.

I did exactly this to produce Figure 1. I am trying to answer the question,"At what home-price difference might different street names become different?" To do this, I assumed that the house prices fit a normal distribution and that their standard deviations are the same. These assumptions can in themselves be perilous? But they give me a good sanity check.

I used a statistical test called a "Kolmogorov-Smirnov" test for my little experiment here. This is a statistical method that tells whether two data sets are from the same or different statistical distributions. Rascoff and Humpries — authors of the Times article — give us an example in which houses on numbered streets are — on the average — worth two percent less than those on named streets.  

I test this with two 1,000-point random normal data sets. One data set has a mean of zero. The other has a mean of 0.02. I run this test 1,000 times. I calculate the Kolmogorov-Smirnov statistic for each of these 1,000 trials. I then calculate the average for the 1,000 trials. I also do this for 999 other cases in which the second data set has a different mean. This amounts to a million tests using a total of two billion data points.

Figure 1 shows the result. I believe that street name might make a difference if the "p-value" from my Kolmogorov-Smirnov test is less than about 0.05. What does Figure 1 show? At a difference of two percent my p-value is much closer to 0.5 than to 0.05. 

Street name should only really matter if average house prices differ by about 14%. Otherwise, the street name does not matter!  There is no evidence that the statistical distribution for house prices on named streets is different from that for house prices on numbered streets. A two-percent difference is well within what I expect from random chance!
Figure 1 — p-value results from Kolmogorov-Smirnov test comparing random normal samples. The plot contains the averages of 1,000 trials each at 1,000 distinct means.

So, Rascoff and Humphries commit three primary Big-Data offenses. I show here their abuse of means. Their tool also doesn't account for problems with small samples. I type in my street name and receive the happy news that houses on my street name are worth on average 105% more than the national average!  But there are only two homes per 100,000 with that address. The sample size cannot be significant.

Finally, the authors' admittedly fun exercise fails to consider other factors. No rational home buyer would use street name as a determining factor when shopping. They consider school districts, crime rates, tax rates, and a myriad of other factors. While their exercise is entertaining, it illustrates the pitfalls we must consider when using analytics for more serious work.






References

¹ S. Rascoff and S. Humphries, "The Secrets of Street Names and Home Values," Sunday Review, The New York Times, January 24, 2015, http://www.nytimes.com/2015/01/25/opinion/sunday/the-secrets-of-street-names-and-home-values.html?ref=opinion.


© The Quant's Prism, 2015

Saturday, January 17, 2015

Clustering analysis demonstration: Analytics results sometimes require us to examine our presuppositions.

Business managers often encounter situations that require them group things together. Typical questions include:

  • How can I segment or sub-segment my customers so that I can appropriately tailor my offerings to them?
  • Which students in my school or district share similar learning challenges requiring specialized approaches to help them learn?
  • Which of my customer-service cases share things in common that might result from some common root cause that I haven't yet discovered?

Cluster analysis is a family of Big-Data Analytics methods that precisely seeks to address these kinds of questions. 

I provide a practical demonstration in this installment. I write as usual for business users of analytics. Arming them with enough knowledge to ask good questions of their analysts is my objective. 

I continue here my tinkering with a data set from the Bureau of Labor Statistics (BLS) O*Net service. O*Net provides free, online career counseling to the general public. O*Net users complete a brief survey about interests, experiences, and preferences. It turns these inputs into recommendations about occupations.

O*Net uses a model based on Holland's Occupational Themes. This model uses 258 features to predict how well-matched a O*Net user might be to one of 923 distinct occupations. The 258 features measure interests, knowledge, abilities, and skills. 


I showed in a previous Quant's Prism installment that these 258 features can be boiled down to about four "abstract" features containing unique information. Most of the Holland model's 258 features are redundant to each other. I show here that most of the 923 occupations fit within one of about 22 clusters.


The following discussion begins by elaborating on this result. I then provide an overview of cluster analysis, its challenges, and pitfalls. Finally, I offer suggestions as to the significance of the results.


Occupational clusters from Holland's model.

Figure 1 shows graphically the results of a cluster analysis of the 923 occupations in O*Net's holland model. Figure 1(a) shows how the data are scattered. Figure 1(b) shows the shapes, sizes, and locations of the clusters into which they fit. The 923 occupations fit mostly into 22 occupational clusters. These clusters are distinguished by abilities, skills, interests, and knowledge.


Figure 1 — Best-practice clustering techniques group the 923 distinct occupations from the Holland model underlying O*Net 18.1 into 22 occupational clusters. The plots were constructed using the RGL package¹ in R.² 


This is the same data set I studied previously. I performed clustering using four of the 258 knowledge, skills, abilities, and interests features. The four features result from a principal-component analysis (PCA). These four features — or dimensions — account for about 95% of all of the variation between the 258 features.

The cluster regions in Figure 1(b) were produced using a model-fitting cluster method. The routine is implemented in the mclust routine³ for the open-source statistical tool R.² The mclust tool attempts to fit the data to geometric models. Well-known linear regression works in much the same way.

Figure 2 zooms in on two adjacent clusters from Figure 1. It shows the scatter points superimposed on into the ellipsoidal regions. These data correspond to the sixth and seventh clusters. We see that the cluster regions partially overlap. Some of the data points also fall outside of the cluster regions. Handling outliers is a challenge for cluster analysis as much as it is for other statistical techniques.

Figure 2 — Scatter points superimposed on ellipsoidal regions from the model give some insight into how well the cluster data fits the model. The data correspond to the sixth and seventh clusters.

Table 1 summarizes the results for the sixth cluster. This cluster contains the statisticians occupation. It includes a total of 48 distinct occupations. These assignments are the statistically best assignments to the statistically best set of clusters identified by the algorithm. The algorithm makes assignments based on 258 features characterizing interests, abilities, knowledge, and skills.

Table 1 — The clustering model assigns occupations to clusters with varying degrees of confidence.

Ten of the 48 occupations in Table 1 are highlighted in red. These are occupations for which the cluster-assignment confidence is less than or equal to 80%. Not every occupation fits well into the 22 clusters depicted in Figure 1(b). This set of clusters and assignments is, statistically speaking, the best choice. They are nonetheless not perfect . Adding more clusters does not improve things.

Cluster analysis — Approaches and pitfalls.

Cluster analysis represents one of the less scientific families of big-data methods. "Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures...."⁴ Researchers — many working within a field referred to as "machine learning" — used common sense to derive algorithms that group things together. Many of them extensively use geometry to identify clusters.

Statistical methods were introduced to cluster analysis more recently. "Model-based" clustering involves fitting the data to a model — in this case a set of clusters. This works much like linear regression, where we fit data to curves. Fitting data to clusters is, of course, more complicated.

Analysts must first answer a number of particularly tricky questions, whether they use a machine-learning or a model-based approach. These include:

  • How many clusters are there in my data?
  • What are the shape, size, and orientation of each cluster? 
  • Are they the same?  Or different?
  • What is the best way to measure the "goodness of fit" to my set of clusters.
There are no known algorithms that directly answer these questions. Analysts start by making educated guesses.  They then do lots of iteration and trial and error. Most packaged clustering algorithms perform a blind search for the best cluster. They compute a score for each trial and then report back which cluster provides the best score.

In the end, my clusters are simply a model for what reality might look like. A famous statistician wrote that, "Essentially, all models are wrong, but some are useful."⁵ Much of the time there may be no natural, underlying grouping that corresponds to a model. Many points in a data set simply might not fit any cluster model very well.

Figure 2 and Table 1 highlight some of these outlier data points. The algorithm I used here tried 110 different cluster models. Each trial involved 50 different guesses about which cluster each occupation belonged to. The 110 models differe in terms of numbers of clusters, shapes, and varieties of shapes.  The best result from these 110 trials simply doesn't do a perfect job of assigning all of the points to my clusters.

The "mclust"³ algorithm tells me the probability that each of the 923 occupations might fit into each of the 22 clusters. Figure 3 summarizes these results with histograms.  I show the probabilities that data points fit in the most likely, second-most likely, or other cluster. Figure 4 shows the same information using a scatter plot.

Figure 3 — Histograms showing the probabilities that data points fit the most-likely, second-most-likely, and other clusters.

Figure 4 — Scatter plot showing probability that each of my 923 data points fit in the most likely versus second-most likely, versus other of the 22 occupational clusters clusters in my model.

Most of the data points fit in the best or second-best cluster with high probabilities. Figure 3(c) reveals nonetheless that lots of the O*Net data points don't fit well with any of my clusters. My model — the best one I can come up with using the best methods — is not a perfectly accurate characterization of the data.

What this all means.

Three obvious takeaways from this exercise occur to me. There are undoubtedly more. I offer these to non-quants seeking to become more-astute users of analytics.

Formidably large data sets don't always tell us as much as we expect them to.  I started with 923 data points spanning 258 dimensions. I previously showed that those 258 dimensions really only contain 4 dimensions of useful data. I show here that these 923 data points are not even as diverse we might expect. Most of the points fit into one of 22 categories.

Managers should not be intimidated by the apparent size and complexity of data sets. Analysts sometimes get lost in data sets' complexities. Users of analysts' work products should keep them focused on the questions for which answers are sought.  Leading analytics consultant Davenport⁶ makes specific recommendations on how to keep analysts on track and honest.

HR managers may have more flexibility than they realize.  Many occupations — characterized in terms of interests, abilities, knowledge, and skills — are more similar than we suspect. Opportunities may exist to move employees — or prospective hires — between occupations within the same or adjacent clusters. This flexibility applies to employees as well.

This analysis in no way repudiates the Holland model on which the data are based. The Holland model contains 258 distinct features. Differences between these features may be significant within the occupational-psychology and labor-economics disciplines that defined them. That 923 occupations fit into 22 clusters may in fact be limitations in the system of measurement.

Cluster analysis — in the end — is an essential part of the big-data analytics toolkit. It however suffers from the same limitations of all analytics methods. We seek through analytics to fit data to a model that only approximates reality. The fit is probabilistic. A good model will — at best — fit most of the data pretty well. Managers using analytics products must still decide how valid the model is, as well as how to deal with the outliers.

References

¹ D. Adler, et al, "RGL: A R-library for 3D visualization with OpenGL," http://rgl.neoscientists.org/arc/doc/RGL_INTERFACE03.pdf
² A. Vance, "Data Analysts Captivated by R’s Power," New York Times, January 6, 2009, http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2&
³ C. Fraley, et al, "mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation," The University of Washington, June 2012, https://www.stat.washington.edu/.../tr597.pdf.
⁴ C. Fraley and A. E. Fraftery, "Model-Based Clustering, Discriminant Analysis, and Density Estimation," Journal of the American Statistical Association, June 2002, p. 611.
⁵ Box, G. E. P., and Draper, N. R., (1987), Empirical Model Building and Response Surfaces, John Wiley & Sons, New York, NY, p. 424, http://en.wikiquote.org/wiki/George_E._P._Box.
⁶  T. H. Davenport, "Keep Up with Your Quants," Harvard Business Review, July 2013, http://goo.gl/BrWpD1.


© The Quant's Prism, 2015