Saturday, January 17, 2015

Clustering analysis demonstration: Analytics results sometimes require us to examine our presuppositions.

Business managers often encounter situations that require them group things together. Typical questions include:

  • How can I segment or sub-segment my customers so that I can appropriately tailor my offerings to them?
  • Which students in my school or district share similar learning challenges requiring specialized approaches to help them learn?
  • Which of my customer-service cases share things in common that might result from some common root cause that I haven't yet discovered?

Cluster analysis is a family of Big-Data Analytics methods that precisely seeks to address these kinds of questions. 

I provide a practical demonstration in this installment. I write as usual for business users of analytics. Arming them with enough knowledge to ask good questions of their analysts is my objective. 

I continue here my tinkering with a data set from the Bureau of Labor Statistics (BLS) O*Net service. O*Net provides free, online career counseling to the general public. O*Net users complete a brief survey about interests, experiences, and preferences. It turns these inputs into recommendations about occupations.

O*Net uses a model based on Holland's Occupational Themes. This model uses 258 features to predict how well-matched a O*Net user might be to one of 923 distinct occupations. The 258 features measure interests, knowledge, abilities, and skills. 


I showed in a previous Quant's Prism installment that these 258 features can be boiled down to about four "abstract" features containing unique information. Most of the Holland model's 258 features are redundant to each other. I show here that most of the 923 occupations fit within one of about 22 clusters.


The following discussion begins by elaborating on this result. I then provide an overview of cluster analysis, its challenges, and pitfalls. Finally, I offer suggestions as to the significance of the results.


Occupational clusters from Holland's model.

Figure 1 shows graphically the results of a cluster analysis of the 923 occupations in O*Net's holland model. Figure 1(a) shows how the data are scattered. Figure 1(b) shows the shapes, sizes, and locations of the clusters into which they fit. The 923 occupations fit mostly into 22 occupational clusters. These clusters are distinguished by abilities, skills, interests, and knowledge.


Figure 1 — Best-practice clustering techniques group the 923 distinct occupations from the Holland model underlying O*Net 18.1 into 22 occupational clusters. The plots were constructed using the RGL package¹ in R.² 


This is the same data set I studied previously. I performed clustering using four of the 258 knowledge, skills, abilities, and interests features. The four features result from a principal-component analysis (PCA). These four features — or dimensions — account for about 95% of all of the variation between the 258 features.

The cluster regions in Figure 1(b) were produced using a model-fitting cluster method. The routine is implemented in the mclust routine³ for the open-source statistical tool R.² The mclust tool attempts to fit the data to geometric models. Well-known linear regression works in much the same way.

Figure 2 zooms in on two adjacent clusters from Figure 1. It shows the scatter points superimposed on into the ellipsoidal regions. These data correspond to the sixth and seventh clusters. We see that the cluster regions partially overlap. Some of the data points also fall outside of the cluster regions. Handling outliers is a challenge for cluster analysis as much as it is for other statistical techniques.

Figure 2 — Scatter points superimposed on ellipsoidal regions from the model give some insight into how well the cluster data fits the model. The data correspond to the sixth and seventh clusters.

Table 1 summarizes the results for the sixth cluster. This cluster contains the statisticians occupation. It includes a total of 48 distinct occupations. These assignments are the statistically best assignments to the statistically best set of clusters identified by the algorithm. The algorithm makes assignments based on 258 features characterizing interests, abilities, knowledge, and skills.

Table 1 — The clustering model assigns occupations to clusters with varying degrees of confidence.

Ten of the 48 occupations in Table 1 are highlighted in red. These are occupations for which the cluster-assignment confidence is less than or equal to 80%. Not every occupation fits well into the 22 clusters depicted in Figure 1(b). This set of clusters and assignments is, statistically speaking, the best choice. They are nonetheless not perfect . Adding more clusters does not improve things.

Cluster analysis — Approaches and pitfalls.

Cluster analysis represents one of the less scientific families of big-data methods. "Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures...."⁴ Researchers — many working within a field referred to as "machine learning" — used common sense to derive algorithms that group things together. Many of them extensively use geometry to identify clusters.

Statistical methods were introduced to cluster analysis more recently. "Model-based" clustering involves fitting the data to a model — in this case a set of clusters. This works much like linear regression, where we fit data to curves. Fitting data to clusters is, of course, more complicated.

Analysts must first answer a number of particularly tricky questions, whether they use a machine-learning or a model-based approach. These include:

  • How many clusters are there in my data?
  • What are the shape, size, and orientation of each cluster? 
  • Are they the same?  Or different?
  • What is the best way to measure the "goodness of fit" to my set of clusters.
There are no known algorithms that directly answer these questions. Analysts start by making educated guesses.  They then do lots of iteration and trial and error. Most packaged clustering algorithms perform a blind search for the best cluster. They compute a score for each trial and then report back which cluster provides the best score.

In the end, my clusters are simply a model for what reality might look like. A famous statistician wrote that, "Essentially, all models are wrong, but some are useful."⁵ Much of the time there may be no natural, underlying grouping that corresponds to a model. Many points in a data set simply might not fit any cluster model very well.

Figure 2 and Table 1 highlight some of these outlier data points. The algorithm I used here tried 110 different cluster models. Each trial involved 50 different guesses about which cluster each occupation belonged to. The 110 models differe in terms of numbers of clusters, shapes, and varieties of shapes.  The best result from these 110 trials simply doesn't do a perfect job of assigning all of the points to my clusters.

The "mclust"³ algorithm tells me the probability that each of the 923 occupations might fit into each of the 22 clusters. Figure 3 summarizes these results with histograms.  I show the probabilities that data points fit in the most likely, second-most likely, or other cluster. Figure 4 shows the same information using a scatter plot.

Figure 3 — Histograms showing the probabilities that data points fit the most-likely, second-most-likely, and other clusters.

Figure 4 — Scatter plot showing probability that each of my 923 data points fit in the most likely versus second-most likely, versus other of the 22 occupational clusters clusters in my model.

Most of the data points fit in the best or second-best cluster with high probabilities. Figure 3(c) reveals nonetheless that lots of the O*Net data points don't fit well with any of my clusters. My model — the best one I can come up with using the best methods — is not a perfectly accurate characterization of the data.

What this all means.

Three obvious takeaways from this exercise occur to me. There are undoubtedly more. I offer these to non-quants seeking to become more-astute users of analytics.

Formidably large data sets don't always tell us as much as we expect them to.  I started with 923 data points spanning 258 dimensions. I previously showed that those 258 dimensions really only contain 4 dimensions of useful data. I show here that these 923 data points are not even as diverse we might expect. Most of the points fit into one of 22 categories.

Managers should not be intimidated by the apparent size and complexity of data sets. Analysts sometimes get lost in data sets' complexities. Users of analysts' work products should keep them focused on the questions for which answers are sought.  Leading analytics consultant Davenport⁶ makes specific recommendations on how to keep analysts on track and honest.

HR managers may have more flexibility than they realize.  Many occupations — characterized in terms of interests, abilities, knowledge, and skills — are more similar than we suspect. Opportunities may exist to move employees — or prospective hires — between occupations within the same or adjacent clusters. This flexibility applies to employees as well.

This analysis in no way repudiates the Holland model on which the data are based. The Holland model contains 258 distinct features. Differences between these features may be significant within the occupational-psychology and labor-economics disciplines that defined them. That 923 occupations fit into 22 clusters may in fact be limitations in the system of measurement.

Cluster analysis — in the end — is an essential part of the big-data analytics toolkit. It however suffers from the same limitations of all analytics methods. We seek through analytics to fit data to a model that only approximates reality. The fit is probabilistic. A good model will — at best — fit most of the data pretty well. Managers using analytics products must still decide how valid the model is, as well as how to deal with the outliers.

References

¹ D. Adler, et al, "RGL: A R-library for 3D visualization with OpenGL," http://rgl.neoscientists.org/arc/doc/RGL_INTERFACE03.pdf
² A. Vance, "Data Analysts Captivated by R’s Power," New York Times, January 6, 2009, http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2&
³ C. Fraley, et al, "mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation," The University of Washington, June 2012, https://www.stat.washington.edu/.../tr597.pdf.
⁴ C. Fraley and A. E. Fraftery, "Model-Based Clustering, Discriminant Analysis, and Density Estimation," Journal of the American Statistical Association, June 2002, p. 611.
⁵ Box, G. E. P., and Draper, N. R., (1987), Empirical Model Building and Response Surfaces, John Wiley & Sons, New York, NY, p. 424, http://en.wikiquote.org/wiki/George_E._P._Box.
⁶  T. H. Davenport, "Keep Up with Your Quants," Harvard Business Review, July 2013, http://goo.gl/BrWpD1.


© The Quant's Prism, 2015

No comments:

Post a Comment