Sunday, May 24, 2015

Ivy-league payoffs: An illustration of causality and determinism

We assume that Ivy-League graduates have it made. Admission to an Ivy leaves one set for life. Not getting into an Ivy consigns a young person to a life of mediocrity. One's alma mater is one's fate, after all. 

But is this really true?  Do the data support this assertion? Research into questions like these lands us smack dab into the middle of two conundrums central to Big-Data Analytics. First, how do we tell what causes what? And second, which factors are the strongest determinants of an outcome?


Figure 1 — A Harvard diploma represents the ultimate capstone for a successful youth. (Source: World Press.) 

This time of year provides us an opportune case study to consider these Big-Data analytics challenges.¹ High school seniors last month were informed of their destinies. April is college-admissions season. Acceptance and rejection letters fulfilled dreams and shattered hopes for many an aspiring collegian.

Meanwhile, college seniors graduate this month. They are in the midst of realizing their destinies. Corporate recruiters, networking, and the formidable quest for employment now beckon. Those fortunate enough to stroll across an Ivy commencement stage are certainly predestined for greatness. Are they not?

Scientific research fortunately allows us to examine these widely held presuppositions. We consider them here from perspective of our two conundrums. The first is causality. To what extent does the name of the university on one's university diploma really determine future earnings?  Is the relationship casual? Does one thing inevitably lead to the other?

Secondly, we care about determinism. Even if alma mater causally determines future earnings, are other factors more important? Do other factors have a bigger say?

Causality:  Is it the school or is it the applicant?

Researchers Stacy Berg-Dale and Alan B. Krueger looked into the exact question of alma mater versus earnings.² They took an innovative tack on the question however. They mined a data set for both cognitive and non-cognitive indicators of students' competitiveness for elite colleges. The cognitive factors included the usual suspects: Grade Point Averages and SAT scores.

The non-cognitive characteristics included the "soft" skills. These include indications of motivation, ambition, and maturity. These attributes are evidenced in interviews, essays, and letters of recommendation. They are obviously more difficult to measure. 

Scientific evidence showing that "soft" skills are stronger predictors of life outcomes than book knowledge is now a couple of decades old.³ Education research⁴ demonstrates their unsurprising importance to learning.

Figure 2 highlights Berg-Dale's and Krueger's research. Figure 2(a) represents the "alma-mater-is-destiny" hypothesis we commonly assume. Figure 2(b) shows the model our researchers applied to their data. They used regression analysis. Figure 2(a) presumes causality. The researchers' method — regression analysis — only uses correlation.


Figure 2 — Dale-Berg and Krueger² used a predictive-modeling technique based solely on correlations to try to understand causal relationship between alma mater and earnings.

Figure 2(c) shows an alternative view. Dale-Berg and Krueger hypothesize that this model might explain the data better than Figure 2(a). It's the applicant, not the university! The traits that make one competitive for admission to an elite university tend to translate into higher earnings after college.  Alma-mater affiliation only marginally contributes to increased earnings.

The alma-mater-is-destiny presupposition reverses the actual causalities underlying students' prospects. It's not the school. It's the competitive traits of the applicant. We suspect that being immersed in an environment replete with intelligent, ambitious peers probably confers some advantages. The tenacity and talent required to become competitive — qualified but not selected — to join that peer group may largely offset many advantages of actually joining the group.

Note that the researchers do not explicitly demonstrate the causality in Figure 2(c). They use correlation — the model in Figure 2(b) — to cast serious doubt on the presumption in Figure 2(a). They then offer an alternative hypothesis. The alternative explanation in Figure 2(c) remains a hypothesis! This is an unavoidable limitation of their regression-based approach.


Determinism: Just how much does the school really matter?

Determinism — just how much each factor contributes — is equally important. Berg-Dale and Krueger found that alma mater makes a weak relative contribution to post-college earnings. I use another closely-related study to illustrate determinism.

Gaertner and McCarty⁵ studied factors that predict college readiness. They wanted to diagnose the traits that lead to college success. Educators seek to anticipate students' success and challenges starting as early as the eighth grade.

Figure 3 summarizes results from Gaertner's and McCarty's research. Like Dale-Berg and Krueger, they used regression analysis. They looked at four indicators of college readiness. They also considered four factors believed to determine college readiness. "Middle-school indicators" in Figure 3 refers to a set of non-cognitive factors, like those in Figure 2.


Figure 3 — We explain college readiness in terms of multiple factors, each of which explains a fraction of the overall readiness. (After Gaertner and McClarty.⁵)

Quants evaluate their regression models using a quantity called "coefficient of determination", or "R²". Figure 3 shows this quantity from Gaertner's and McCarty's analysis. R² explains the proportion of total variation contributed by each factor in the model. We see in Figure 3 that SAT and High-School GPA (HSGPA) contributed most to explaining students' college readiness. The "Middle-school indicators" made the second-largest contribution.

There is a problem with R² however. Specifically, the R²'s for all the factors must sum up to a total less than one. So the more factors you mix into your model, the smaller the relative contribution each may make

This is particularly true of the factors are correlated with each other. Correlated factors "divide the credit" in regression analysis. It is hard to tell which really contributes the most. Regression analysis cannot by itself tell you that one factor causes another! We probably want to assign a higher weight to a root cause than to some intermediate affect. Regression cannot do this.


The limits of regression analysis.

Regression analysis is the most widely-used technique in the Big-Data Analytics tookbox. We see here two indications of its limitations. First, it does not account for — or indicate — causality. It simply uses knowledge about how things are correlated with each other.

This can be a very big deal! Root causes are vitally important to many business decisions. Analytics results based on regression analysis alone have limited value for root-cause analysis. By themselves, they tell us very little about cause and Affect.

Figure 4 illustrates this. Figure 4(a) shows a notional scenario comprised of a chain of cause-and-affect relationships. We are interested in the "Affect of Primary Interest" — or primary affect. This affect mostly occurs because of a single root cause. The root cause does not directly lead to the primary affect. It acts through a pair of intermediate affects.

Figure 4(b) illustrates how regression analysis treats this scenario. Our primary affect is directly correlated with the root cause and with the two intermediate affects. Regression analysis cannot distinguish between the three. There is no guarantee that the regression model will even assign the highest weight to the root cause!


Figure 4 — Analytics methods based on regression analysis cannot detect root causes in chains of causal events. 

Many applicants to elite universities suffer acutely from not understanding this kind of causality. They experience enormous stress during college-admissions season. Lots of them succumb to a winner-take-all view of university selections. 

This is a problem for at least two reasons. First, an ambitious, disciplined, talented young person tends to come out okay irrespective of where she goes to school. Second, seventeen-year-olders simply don't know enough about themselves — or about life — to make rest-of-their-life decisions anyway. The stakes are simply not that high!

Determinism represents the second issue with regression analysis. Regression analysis uses weighted sums. We're simply adding things up. These sums can only be so large. A regression model may not even assign the highest weight to the root cause in a causal chain of events! This oversimplifies the world. It also provides an incomplete view.

The real world consists of complex, causal interrelationships. Life is chocked full of vicious and virtuous cycles! Correlation alone doesn't really explain these very well. It only tells us that some things tend to go together — that they are "co-related." 

Leaders using analytics to make decisions must understand how their analysts formulate recommendations.⁶ Big-data analytics contains no alchemy. It offers neither sorcery nor black magic. Decision makers must apply extreme caution to analytics results that cannot be explained apart from, "This is what the tool says."



References

¹ D. Thompson, "'It Doesn't Matter Where You Go to College': Inspirational, but Wrong," The Atlantic, April 2, 2015, http://www.theatlantic.com/business/archive/2015/04/the-3-percent-crisis/389396/.
² S. B. Dale and A. B. Krueger, "Estimating the payoff to attending a more selective college: An application of Selection on observables and unobservables", Quarterly Journal of Economics, Oxford Journals, December 2002, http://qje.oxfordjournals.org/content/117/4/1491.short.
³ D. Coleman, Emotional Intelligence, Random House, 2002, http://goo.gl/PD5sJf.
⁴ C. A. Farrington, et al, "Teaching Adolescents To Become Learners The Role of Noncognitive Factors in Shaping School Performance: A Critical Literature Review", University of Chicago Consortium of Chicago School Research (CCSR), June 2012, https://goo.gl/TJpXGl.
⁵ M. N. Gaertner and K. L. McClarty, "Performance, Perseverance, and the Full Picture of College Readiness", Educational Measurement: Issues and Practice, National Council on Measurements in Education, April 2015, http://onlinelibrary.wiley.com/doi/10.1111/emip.12066/abstract.
⁶ T. H. Davenport, "Keeping up with your quants," Harvard Business Review, July-Aug 2013, http://goo.gl/rtkq8G.


Saturday, March 21, 2015

Sports analytics illustrates cultural and practical factors foranalytics adoption.

Sports analytics is really sexy. Michael Lewis' 2004 best-seller Moneyball¹ put it on the map. A subsequent movie staring Brad Pitt didn't hurt things much, either. The MIT-Sloan School of Business even holds an annual conference on the topic.

We might expect — given all they hype — that analytics' reign in sports is unchallenged. The Moneyball movie after all places words to that effect in the mouth of Boston Red Sox' owner John Henry. The Henry character suggests that any organization not reinventing itself after the Billy Beane model would become a dinosaur.

Analytics does not in fact rule sports. The web page for sports network ESPN recently published a survey ranking major sports teams by their extent of analytics adoption.² Adoption was very non-uniform within major sports and between sports.



Figure 1 — Sports website espn.com recently published a survey² on the adoption of analytics in sports. Not only did the extent of adoption between sports vary, but also within sports.

What explains this disparity? At least two factors come into play. Much has been written about the role of culture in analytics adoption. Some businesses just lend themselves more to analytics. I briefly explore both factors here.


Cultures of data.

ESPN's survey shows that analytics plays a bigger role in baseball than other major U.S. sports. ESPN characterized each team in each of the four major U.S. professional sports leagues — Baseball, Basketball, Football, Hockey — in terms of five tiers of adoption:
  • All-in;
  • Believers; 
  • One foot in;
  • Skeptics; and
  • Non-believers.
Sixteen of 30 professional-baseball teams fell into one of the top two categories. Basketball came in second, with twelve of 30 major franchises in the top two categories. 

Lavalle³ and Davenport⁴ consider extensively the role in culture to the adoption of analytics. Lavelle writes, "The adoption barriers that organizations face are managerial and cultural, rather than related to data and technology." Davenport's Analytics at Work devotes an entire chapter to "Building an analytics culture."


Baseball has a long, rich history of statistics. A college dormitory roommate comes to mind as an example. This young man often had three things going on at once as I entered the room. The television was on, with the volume turned all the way down. The radio was playing music. And he had a six-inch-thick encyclopedia of baseball statistics open on his lap. I saw no evidence of athletic participation in him. But he devoured baseball statistics.

A casual search of Amazon.com brings up volumes on baseball statistics. The legendary Bill James baseball abstract⁵ contains more than 1,000 pages of baseball history, rich with statistics. An annual update⁶ contains nearly 600 pages describing analytical methodologies. Baseball America publishes an annual edition to a baseball almanac⁷ presenting "a comprehensive statistical review from the majors all the way through to youth baseball." For the quant each of us knows and loves, there is even a book⁸ on using the R programming language to analyze baseball statistics.

Theories underlying basketball statistics have arrived more recently. Amazon yields fewer hits overall. A math professor and a software engineer collaborated on a recent book⁹ describing statistical methods for basketball. That this work was not picked up by a major commercial publisher might suggest something about the market for basketball statistics.


Susceptibility to mathematical analysis.

Some activities just lend themselves better to mathematical analysis. The business literature recognizes that some business activities should remain more flexible and less structured.¹⁰ Human endeavors can be characterized by their locations along a continuum spanning from "pure art" to "pure science."

Activities that are more scientific easily lend themselves to mathematical specification. Hypothesis testing lies at the heart of the scientific method. I must specify an activity with mathematical precision before I apply analytics to it. Robin Williams' character in the 1989 movie Dead Poets Society gives us an illustrative mockery of the problems with applying scientific methods to artistic endeavors.

Among major U.S. professional sports, baseball happens to lend itself to mathematical analysis. The game is highly structured. The flow of a baseball game is characterized by discrete states. These states are characterized by innings and outs, strikes and pitches, players on base, points scored. 

Each play — commencing with the pitcher's release of the ball towards the batter — creates the the opportunity to move from one state to the next. Individual player's abilities — hitting, running, throwing, catching — determine their ability to contribute to transition to a next state that is more favorable to his team than the previous. Player statistics are largely based on how well their play moves the game to more-advantageous state.

Baseball enjoys an underlying structure that is inherently mathematical. A set of mathematical methods exists focused on characterizing transitions from one discrete to another. This set of methods is referred to as Markov Chains.¹¹ Baseball-statistics aficionados may not think about Markov chains. But they are there.

Tennis — not one of the "Big Four" U.S. professional sports — enjoys a similarly mathematical structure. The flows of tennis games progress through a discrete set of states. A statistical researcher recently described a model for tennis matches.¹²

The remaining three of the "Big Four" U.S. professional sports lack the underlying mathematical structure of baseball and tennis. Football is arguably the most structured. Beyond ball possession and points scored, scoring drives follow through discrete states characterized by down, yards to go, and location of line of scrimmage. 

Three of these states are continuous. The line of scrimmage and yards to go are represented by continuous numbers. Transitions from one state to the next are also continuous in nature. These factors lend themselves to statistical analysis less easily than discrete states and transitions of baseball.

Basketball and hockey are even less structured. Game flows result from continuous, random interactions between players. ESPN's survey finds that analytics adoption in basketball is more advanced than in hockey. This may be because basketball teams are smaller. Contributions by individual players have a greater bearing on the flow of play.


The path to analytics adoption.

Becoming an analytics organization involves both cultural and practical aspects of change. Analytics-driven cultures enjoy the distinct quality of thinking quantitatively. Members of such organizations habitually measure aspects of their work. They benchmark measurements of the most important aspects of their activities. They also measure performance against those benchmarks.

Baseball, among major U.S. professional sports, enjoys a culture that is substantially analytical. Baseball statistics occupy a prominent place in the sport's fandom. Statistics are sufficiently important to the business of baseball for major commercial publishers to release large volumes.

Explaining the prominence of statistics in baseball in particular may be a "chicken-or-the-egg" question. That the structure of the game is fundamentally mathematical in nature certainly does not introduce any obstacles to a popular baseball-statistics sub-culture. 

Organizational leaders seeking to inculcate analytics more deeply into their organizations must manage cultural change, first and foremost. Cultural change is perhaps the most formidable of management challenges. Management guru Peter Drucker's dictum "Culture eats strategy for breakfast" states the challenge well.

Leaders must also recognize the limitations of analytics. In order to apply analytics to a problem, we first must describe it with mathematical precision. Not everything does — or necessarily should — lend itself to this degree of precision.¹⁰

Assertions by exuberant advocates suggest that anything can be measured.¹³ Many such improvisational measurements require selection of proxies — substitutes for quantities that cannot be directly observed. Such proxies may lack precision — or sufficiently direct correlation to the desired quantity — to usefully reduce uncertainty. 

Methods from the economics of information give us a rigorous approach to quantify the value of marginal uncertainty. Measurements and reporting based on proxies may not yield information-economic returns worthy of the required investment. Sports analytics gives us practical illustrations. 




References

¹ M. Lewis, Moneyball — The Art of Winning an Unfair Game, Norton, 2004, http://goo.gl/f7N2Yu.
² "The great analytics rankings," espn.com, February 25, 2015, http://espn.go.com/espn/feature/story/_/id/12331388/the-great-analytics-rankings.
³ S. Lavalle,  et al, "Big data, analytics and the path from insights to value," MITSloan Management Review, Winter 2011, http://goo.gl/8RSn5H.
⁴ T. H. Davenport, Analytics at work, Boston: 2010, Harvard Business Review Press, http://goo.gl/olZkKm.
⁵ B. James, The New Bill James Historical Baseball Abstract, New York: Free Press, June 13, 2003, http://goo.gl/Q7a0iA.
⁶ B. James, The Bill James Handbook 2015, Chicago: ACTA Publications, October 31, 2014, http://goo.gl/OEHinJ.
⁷ Baseball America 2015 Almanac: A Comprehensive Review of the 2014 Season, Baseball America, January 6, 2015, http://goo.gl/xCenv9.
⁸ M. Marchi and J. Albert, Analyzing Baseball Data with R, Boca Raton, FL: October 29, 2013, http://goo.gl/gx4J7r.
⁹ S. M. Shea  and C. E. Baker, Basketball Analytics: Objective and Efficient Strategies for Understanding How Teams Win, CreateSpace Independent Publishing Platform, November 5, 2013, http://goo.gl/rE36pI.
¹⁰ J. M. Hall and M. E. Johnson, "When should a process be art, not science?" Harvard Business Review, March 2009, http://goo.gl/q6rGKB.
¹¹ J. R. Norris, Markov Chains, London: Cambridge University Press, Jul 28, 1998, http://goo.gl/J0nV8S.
¹² C. Gray, "Game, set and stats," Significance, Royal Statistical Society, February 3, 2015, http://goo.gl/8wdgH7.
¹³ D. W. Hubbard, How to Measure Anything: Finding the Value of Intangibles in Business, New York: Wiley, 2014, http://goo.gl/cFuFTb.

Thursday, February 19, 2015

The theory underlying "The End of Theory" in Big Data. (updated)

Periods of social and economic change ignite our imaginations. New technologies and socioeconomic phenomena elicit both promise and dread.¹ We can, at the same time, see promises of utopia and specters of dystopian "Brave New Worlds."

Reality usually falls somewhere in between these extremes. Many anticipated that the Internet would "kill" distance and make the world flat.² Our planet nonetheless remains round³ — both geographically and socioeconomically. Modern transportation networks may reduce distance's relative importance. Other dimensions of culture, polity, and economics still matter as much as ever.⁴

The Internet was to lead to a "new economy." Website traffic rates supplanted net-free cash flow as the principal basis for valuing many firms. At its peak, America Online's market capitalization once exceeded that of Boeing. Corrections in the end returned familiar approaches to business valuation to a more-conventional place.⁵,


Genomics promised to revolutionize medicine. We now face however the reality of that science's limitations.  A recent study "confirms that genetics are a poor, if not purposeless, prognostic of the chance of getting a disease."⁷ Combined effects of nature and nurture prove difficult to unwind.

Big-Data analytics is following a similar pattern. Industries — particularly those based on technology — tend to follow lifecycle trajectories.⁸ Gartner uses a "hype-cycle" method to track these trajectories. The consultancy's most-recent analysis — in Figure 1 — shows gravity's affect on the Big-Data movement.¹¹,¹²  The mainstream press even recognizes this trend.¹²




Figure 1 — Gartner's 2014 "Hype cycle for emerging technologies" report shows Big Data sliding down from the "Peak of Inflated Expectations" towards the trough of disillusionment."¹⁰

The inaugural installment of Quant's Prism showed Big Data at the "peak of inflated expectations" on Gartner's curve. This blog's narrative consistently emphasizes the importance scientific discipline to business analytics. Value is derived through data science. Big Data's slide may be attributable in part to occasional lack of scientific rigor in its application.


Naïveté underlies the irrational exuberance accompanying highly-hyped, "cool" new technologies. This naïveté 
often involves discounting fundamental principals. Mother Nature nonetheless gets to vote — and her's counts. Mathematics — particularly statistics — is the science to which Big Data is subject.

Financial Times columnist Tim Harford — the "Undercover Economist" — describes four fundamental precepts of statistics that Big Data enthusiasts sometimes set aside.¹³  This installment of Quant's Prism summarizes Hartford's observations. I provide my own illustrations.



Setting science aside.

Harford summarizes four points of view comprising an integrated thought system. They appear in different places. Each idea tends however to be interconnected with the others. 

Theory doesn't matter.

Exuberant claims about applying machine-learning to large, complex data sets largely originate from the tech industry. Tech-industry evangelist Chris Anderson first suggested "The end of theory" hypothesis in Wired.¹⁴ Anderson grounds his rationale on statistician George Box' observation, "All models are wrong, but some are useful."

Google's experiment¹⁶ in the use of its search engine to tack the 2008 flu season is offered as the demonstration that proves the "end-of-theory" assertion. This anecdote regrettably failed the repeatability criterion. Scientific conclusions are only validated if they can be independently verified. Google's search-engine experiment failed to deliver the same success during subsequent years.¹⁷ Its results were neither repeatable, nor independently verifiable.

The absence of an underlying theory likely explains the absence of repeatability with Google's flu-tracking experiment. Google assumed relationships between users' medical conditions and their Internet searches. These relationships might change over time. Any such tracking model requires validation of the relationships and precise tracking of their change over time. In short, it takes a theory!

The sample can include the whole population (N → all).

"N →all" assertions largely occur related to Internet and social-media data. They use what Harford calls data "exhaust." Data exhaust includes things observed by tools that track website and smartphone use. These are the footprints connected users leave wherever they go in cyberspace. 

Some "N →all" enthusiasts conveniently assume that observations about online-exhaust data serve as a proxy for the population as a whole. These data in fact provide a biased view of the population. Adoption of exhaust-producing online technologies is far from uniform. 

Figure 2¹⁹ shows the unevenness of home broadband-access penetration within the U.S. Penetration by smartphones — another source of online-exhaust data — similarly varies by age²⁰ and ethnicity.²¹ Analyzing exhaust data only tells us about users of particular exhaust-producing technologies. Exhaust data tell us nothing about non-users. This is not equivalent to "N →all."

Figure 2 — Large samples of data derived from exhaust from home broadband internet will be biased according to the proportion of households with the service. (from Financial Times.¹⁹)


Analyzing the data alone can explain it all.

Leading statistician David Spiegelhalter observes, “There are a lot of small data problems that occur in big data. They don’t disappear because you’ve got lots of the stuff. They get worse.”¹³ This blog has already briefly explored the issue of dimensionality in large data sets. Analytics practitioners frequently battle "the curse of dimensionality."²¹

Simply throwing a pile of data at a canned algorithm rarely produces valuable insight. Systematic methods are required. Understanding and preparing the data typically requires 75% to 80% of the effort in a modeling project. Preparation is always guided by analysis of the business stated as a hypothesis to test a theory about what the data mean.²² These hypotheses systematically link possible explanations about the business problem to a mathematical description.²³

Correlation is enough, and causality doesn't matter.

The "causality doesn't matter" assertion appears in the May 2013 article of Foreign Affairs.¹⁵ Authors Cukier and Mayer-Schoenberger similarly base their arguments in the culture of the Internet. Anderson's, Cukier's, and Mayer-Schoenberger's common link to British news weekly Economist is interesting. This correlation, incidentally, does not predict that all Economist staffers hold this view.

Ars Tecnica Senior Science Editor John Timmer places correlations in their place: "Correlations are a way of catching a scientist's attention, but the models and mechanisms that explain them are how we make the predictions that not only advance science, but generate practical applications."²⁴ Correlations provide opportunities for further exploration involving hypothesis testing to infer underlying causalities. 


"False-positive" correlations frequently occur, particularly in high-dimensional data sets.²¹ An unexplained correlation at best provides a tenuous basis for action. This is especially true when substantial value is at risk. Basing big decisions on causal understandings is the safest path.

The health sciences offers a current example of shortcomings of correlation-only decision-making. The federal government's nutritional policies have been based largely on epidemiological studies, which are "observational" in nature.²⁶  That is, they attempt to draw conclusions solely from correlations.  The continuously-varying nutritional guidelines illustrate the challenge of Big-Data analysis that doesn't explain causality.

Approaching Big Data scientifically.

The bloom may be falling off the Big-Data rose. This is a completely normal — and healthy — phase of Big-Data evolution. Gartner's "Hype-Cycle" methodology anticipates this. A "trough of disillusionment" following a "peak of inflated expectations" corresponds to an "awkward adolescent" stage of life

Moore's (not to be confused with G. E. Moore of "Moore's Law" fame) lifecycle⁸ uses the terms "chasm," "tornado," and "bowling alley" to characterize the volatility of the early stages of technology life-cycles. Would-be gold-rushers exuberantly rush in, seeking alchemic payoffs. Most are disappointed.

Business leaders looking for competitive differentiation from Big Data must remain patient. Big Data will reach a "plateau of productivity." Proven, repeatable practices will emerge. These practices will be based — like all mature business practices — on systematic, scientific approaches. Early adopters must keep their heads. Sketches of pragmatic, deliberate approaches — exemplified by Davenport²² and Lavalle²⁵ — have already been in the literature for years.



References

¹ D. Bollier, The Promise and Peril of Big Data, The Aspen Institute, January 1, 2010, http://www.aspeninstitute.org/publications/promise-peril-big-data.
² T. L. Friedman, The World Is Flat, New York: Macmillan, 2007, http://goo.gl/q2UdPd
³ L. Prusak, "The world is round," Harvard Business Review, April 2006, https://hbr.org/2006/04/the-world-is-round. 
⁴ P. Ghemawat, World 3.0: Global Prosperity and How to Achieve It, Boston: Harvard Business Press, 2011, http://goo.gl/QLkVOK.
⁵ M. Porter, "Strategy and the Internet," Harvard Business Review, March 2001, https://hbr.org/2001/03/strategy-and-the-internet.
⁶ C. M. Reinhart and K. Rogoff, This Time Is Different: Eight Centuries of Financial Folly, Princeton, NJ: Princeton University Press, http://goo.gl/C5Zfnc
⁷ D. Shenk, "The Limits of Genetic Testing," The Atlantic, April 3, 2013, http://www.theatlantic.com/health/archive/2012/04/the-limits-of-genetic-testing/255416/
 G. Moore, "Darwin and the demon: Innovating within established enterprises," Harvard Business Review, July 2004, https://hbr.org/2004/07/darwin-and-the-demon-innovating-within-established-enterprises
¹⁰ J. Rivera, "Gartner's 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business," The Gartner Group, http://www.gartner.com/newsroom/id/2819918.
¹¹ M. Wheatley, "Gartner’s Hype Cycle: Big Data’s on the slippery slope," SiliconANGLE, August 19, 2014, http://siliconangle.com/blog/2014/08/19/gartners-hype-cycle-big-datas-on-the-slippery-slope/.
¹² G. Langer, "Growing Doubts About Big Data," ABC News, April 8, 2014, http://abcnews.go.com/blogs/politics/2014/04/growing-doubts-about-big-data/.
¹³ T. Harford, Big data: are we making a big mistake?" Significance, The Royal Statistical Society, December 2014, pp. 14 - 19,  http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2014.00778.x/abstractFinancial Times, March  28, 2014, http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html.
¹⁴ C. Anderson, "The end of theory: The data deluge makes the scientific method obsolete," Wired, June 23, 2008, http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory.
¹⁵ K. N. Cukier and V. Mayer-Schoenberger, "The rise of big data," Foreign Affairs, May/June 2013, pp. 28-36, http://goo.gl/3tGMZc.
¹⁶ J. Ginsberg, et al, Detecting influenza epidemics using search engine query data," Nature, November 19, 2008, http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html, (republished by Google to http://goo.gl/Ly0bwR).
¹⁷ H. Hodson, "Google Flu Trends gets it wrong three years running," NewScientist — Health, March 13, 2014, http://www.newscientist.com/article/dn25217-google-flu-trends-gets-it-wrong-three-years-running.html#.VOQZq7DF-oU.
¹⁸ D. Crow, "Digital divide exacerbates US inequality," Financial Times, October 28, 2014, http://www.ft.com/intl/cms/s/2/b75d095a-5d76-11e4-9753-00144feabdc0.html#axzz3J94dMqob.
¹⁹ "US Smartphone Penetration by Ethnicity, Q12012," Neilson, June 1, 2012, accessed from Beehive Group, http://beehive.me/2012/06/us-smartphone-penetration-by-ethnicity/
²⁰ "U.S. smartphone penetration by age group, 2Q2014," MarketingCharts, http://goo.gl/VCvjDt.
²¹ T. Hastie, et alThe Elements of Statistical Learning, second edition, Springer, 2009, http://goo.gl/ipNrMUhttp://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf.
²² T. H. Davenport, "Keep up with your quants," Harvard Business Review, July 2013, http://goo.gl/BrWpD1.
²³ D. R. Cox and D. V. Hinkley, Theoretical Statistics, Boca Raton, FL: CRC Press, 1996, http://goo.gl/1zw8H0.
²⁴ J. Timmer, "Why the cloud cannot obscure the scientific method," Ars TecnicaJune 25, 2008, http://arstechnica.com/uncategorized/2008/06/why-the-cloud-cannot-obscure-the-scientific-method/.
²⁵ S. Lavalle,  et al, "Big data, analytics and the path from insights to value," MITSloan Management Review, Winter 2011, http://goo.gl/8RSn5H.
²⁶ N. Teicholz, "The government’s bad diet advice," Washington Post, February 20, 2015, http://goo.gl/aeyNwC.







© The Quant's Prism, 2015

Sunday, January 25, 2015

Remain skeptical when looking at slick data-mining reports.

Nobel-Prize economist Ronald Coase warned, "If you torture the data long enough, it will confess."  Our current infatuation Big Data sometimes sometimes leads us down this road of unjustified exuberance. Large data sets are more accessible than ever. And tinkering with them can be really fun!

The Sunday Review section of today's New York Times provides a timely illustration. Researchers for the Times went out looking for correlations between street name and home value.¹  They package their findings behind slick, interactive graphics. We can type in our street name and see how the price of homes located on streets with that name compare with national averages.

This topic entices us immediately. Our home — for many of us — stores most of our household wealth. We are also less than a dozen years removed from the roller-coaster mass hysteria of a nationwide housing bubble. Many of us might still be more nervous than ever as we watch our prize assets' unsteady recovery of pre-crash value. 

The authors' first example gives us a red flag. They assert, "On average, homes on named streets are 2 percent more valuable than those on numbered streets." We should immediately ask ourselves, "How significant is this?" Should the street name influence my next home-buying decision?

The street-name to home-value analysis stumbles into three common Big-Data pitfalls. I mostly focus here on the first: They only tell us about averages. Averages — or means — are among the most-abused of all statistics. They can become outright obfuscatory when offered in absence of other statistics. "What is the standard deviation," should be our reflexive response whenever someone offers an average during a serious conversation.

The Times authors don't give us any standard deviations. So what can we do?  We can ask, "How significant is a 2% difference in home values in your study?"  This is easy to test. We can get a pretty good idea on our own, even without access to their data.

I did exactly this to produce Figure 1. I am trying to answer the question,"At what home-price difference might different street names become different?" To do this, I assumed that the house prices fit a normal distribution and that their standard deviations are the same. These assumptions can in themselves be perilous? But they give me a good sanity check.

I used a statistical test called a "Kolmogorov-Smirnov" test for my little experiment here. This is a statistical method that tells whether two data sets are from the same or different statistical distributions. Rascoff and Humpries — authors of the Times article — give us an example in which houses on numbered streets are — on the average — worth two percent less than those on named streets.  

I test this with two 1,000-point random normal data sets. One data set has a mean of zero. The other has a mean of 0.02. I run this test 1,000 times. I calculate the Kolmogorov-Smirnov statistic for each of these 1,000 trials. I then calculate the average for the 1,000 trials. I also do this for 999 other cases in which the second data set has a different mean. This amounts to a million tests using a total of two billion data points.

Figure 1 shows the result. I believe that street name might make a difference if the "p-value" from my Kolmogorov-Smirnov test is less than about 0.05. What does Figure 1 show? At a difference of two percent my p-value is much closer to 0.5 than to 0.05. 

Street name should only really matter if average house prices differ by about 14%. Otherwise, the street name does not matter!  There is no evidence that the statistical distribution for house prices on named streets is different from that for house prices on numbered streets. A two-percent difference is well within what I expect from random chance!
Figure 1 — p-value results from Kolmogorov-Smirnov test comparing random normal samples. The plot contains the averages of 1,000 trials each at 1,000 distinct means.

So, Rascoff and Humphries commit three primary Big-Data offenses. I show here their abuse of means. Their tool also doesn't account for problems with small samples. I type in my street name and receive the happy news that houses on my street name are worth on average 105% more than the national average!  But there are only two homes per 100,000 with that address. The sample size cannot be significant.

Finally, the authors' admittedly fun exercise fails to consider other factors. No rational home buyer would use street name as a determining factor when shopping. They consider school districts, crime rates, tax rates, and a myriad of other factors. While their exercise is entertaining, it illustrates the pitfalls we must consider when using analytics for more serious work.






References

¹ S. Rascoff and S. Humphries, "The Secrets of Street Names and Home Values," Sunday Review, The New York Times, January 24, 2015, http://www.nytimes.com/2015/01/25/opinion/sunday/the-secrets-of-street-names-and-home-values.html?ref=opinion.


© The Quant's Prism, 2015

Saturday, January 17, 2015

Clustering analysis demonstration: Analytics results sometimes require us to examine our presuppositions.

Business managers often encounter situations that require them group things together. Typical questions include:

  • How can I segment or sub-segment my customers so that I can appropriately tailor my offerings to them?
  • Which students in my school or district share similar learning challenges requiring specialized approaches to help them learn?
  • Which of my customer-service cases share things in common that might result from some common root cause that I haven't yet discovered?

Cluster analysis is a family of Big-Data Analytics methods that precisely seeks to address these kinds of questions. 

I provide a practical demonstration in this installment. I write as usual for business users of analytics. Arming them with enough knowledge to ask good questions of their analysts is my objective. 

I continue here my tinkering with a data set from the Bureau of Labor Statistics (BLS) O*Net service. O*Net provides free, online career counseling to the general public. O*Net users complete a brief survey about interests, experiences, and preferences. It turns these inputs into recommendations about occupations.

O*Net uses a model based on Holland's Occupational Themes. This model uses 258 features to predict how well-matched a O*Net user might be to one of 923 distinct occupations. The 258 features measure interests, knowledge, abilities, and skills. 


I showed in a previous Quant's Prism installment that these 258 features can be boiled down to about four "abstract" features containing unique information. Most of the Holland model's 258 features are redundant to each other. I show here that most of the 923 occupations fit within one of about 22 clusters.


The following discussion begins by elaborating on this result. I then provide an overview of cluster analysis, its challenges, and pitfalls. Finally, I offer suggestions as to the significance of the results.


Occupational clusters from Holland's model.

Figure 1 shows graphically the results of a cluster analysis of the 923 occupations in O*Net's holland model. Figure 1(a) shows how the data are scattered. Figure 1(b) shows the shapes, sizes, and locations of the clusters into which they fit. The 923 occupations fit mostly into 22 occupational clusters. These clusters are distinguished by abilities, skills, interests, and knowledge.


Figure 1 — Best-practice clustering techniques group the 923 distinct occupations from the Holland model underlying O*Net 18.1 into 22 occupational clusters. The plots were constructed using the RGL package¹ in R.² 


This is the same data set I studied previously. I performed clustering using four of the 258 knowledge, skills, abilities, and interests features. The four features result from a principal-component analysis (PCA). These four features — or dimensions — account for about 95% of all of the variation between the 258 features.

The cluster regions in Figure 1(b) were produced using a model-fitting cluster method. The routine is implemented in the mclust routine³ for the open-source statistical tool R.² The mclust tool attempts to fit the data to geometric models. Well-known linear regression works in much the same way.

Figure 2 zooms in on two adjacent clusters from Figure 1. It shows the scatter points superimposed on into the ellipsoidal regions. These data correspond to the sixth and seventh clusters. We see that the cluster regions partially overlap. Some of the data points also fall outside of the cluster regions. Handling outliers is a challenge for cluster analysis as much as it is for other statistical techniques.

Figure 2 — Scatter points superimposed on ellipsoidal regions from the model give some insight into how well the cluster data fits the model. The data correspond to the sixth and seventh clusters.

Table 1 summarizes the results for the sixth cluster. This cluster contains the statisticians occupation. It includes a total of 48 distinct occupations. These assignments are the statistically best assignments to the statistically best set of clusters identified by the algorithm. The algorithm makes assignments based on 258 features characterizing interests, abilities, knowledge, and skills.

Table 1 — The clustering model assigns occupations to clusters with varying degrees of confidence.

Ten of the 48 occupations in Table 1 are highlighted in red. These are occupations for which the cluster-assignment confidence is less than or equal to 80%. Not every occupation fits well into the 22 clusters depicted in Figure 1(b). This set of clusters and assignments is, statistically speaking, the best choice. They are nonetheless not perfect . Adding more clusters does not improve things.

Cluster analysis — Approaches and pitfalls.

Cluster analysis represents one of the less scientific families of big-data methods. "Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures...."⁴ Researchers — many working within a field referred to as "machine learning" — used common sense to derive algorithms that group things together. Many of them extensively use geometry to identify clusters.

Statistical methods were introduced to cluster analysis more recently. "Model-based" clustering involves fitting the data to a model — in this case a set of clusters. This works much like linear regression, where we fit data to curves. Fitting data to clusters is, of course, more complicated.

Analysts must first answer a number of particularly tricky questions, whether they use a machine-learning or a model-based approach. These include:

  • How many clusters are there in my data?
  • What are the shape, size, and orientation of each cluster? 
  • Are they the same?  Or different?
  • What is the best way to measure the "goodness of fit" to my set of clusters.
There are no known algorithms that directly answer these questions. Analysts start by making educated guesses.  They then do lots of iteration and trial and error. Most packaged clustering algorithms perform a blind search for the best cluster. They compute a score for each trial and then report back which cluster provides the best score.

In the end, my clusters are simply a model for what reality might look like. A famous statistician wrote that, "Essentially, all models are wrong, but some are useful."⁵ Much of the time there may be no natural, underlying grouping that corresponds to a model. Many points in a data set simply might not fit any cluster model very well.

Figure 2 and Table 1 highlight some of these outlier data points. The algorithm I used here tried 110 different cluster models. Each trial involved 50 different guesses about which cluster each occupation belonged to. The 110 models differe in terms of numbers of clusters, shapes, and varieties of shapes.  The best result from these 110 trials simply doesn't do a perfect job of assigning all of the points to my clusters.

The "mclust"³ algorithm tells me the probability that each of the 923 occupations might fit into each of the 22 clusters. Figure 3 summarizes these results with histograms.  I show the probabilities that data points fit in the most likely, second-most likely, or other cluster. Figure 4 shows the same information using a scatter plot.

Figure 3 — Histograms showing the probabilities that data points fit the most-likely, second-most-likely, and other clusters.

Figure 4 — Scatter plot showing probability that each of my 923 data points fit in the most likely versus second-most likely, versus other of the 22 occupational clusters clusters in my model.

Most of the data points fit in the best or second-best cluster with high probabilities. Figure 3(c) reveals nonetheless that lots of the O*Net data points don't fit well with any of my clusters. My model — the best one I can come up with using the best methods — is not a perfectly accurate characterization of the data.

What this all means.

Three obvious takeaways from this exercise occur to me. There are undoubtedly more. I offer these to non-quants seeking to become more-astute users of analytics.

Formidably large data sets don't always tell us as much as we expect them to.  I started with 923 data points spanning 258 dimensions. I previously showed that those 258 dimensions really only contain 4 dimensions of useful data. I show here that these 923 data points are not even as diverse we might expect. Most of the points fit into one of 22 categories.

Managers should not be intimidated by the apparent size and complexity of data sets. Analysts sometimes get lost in data sets' complexities. Users of analysts' work products should keep them focused on the questions for which answers are sought.  Leading analytics consultant Davenport⁶ makes specific recommendations on how to keep analysts on track and honest.

HR managers may have more flexibility than they realize.  Many occupations — characterized in terms of interests, abilities, knowledge, and skills — are more similar than we suspect. Opportunities may exist to move employees — or prospective hires — between occupations within the same or adjacent clusters. This flexibility applies to employees as well.

This analysis in no way repudiates the Holland model on which the data are based. The Holland model contains 258 distinct features. Differences between these features may be significant within the occupational-psychology and labor-economics disciplines that defined them. That 923 occupations fit into 22 clusters may in fact be limitations in the system of measurement.

Cluster analysis — in the end — is an essential part of the big-data analytics toolkit. It however suffers from the same limitations of all analytics methods. We seek through analytics to fit data to a model that only approximates reality. The fit is probabilistic. A good model will — at best — fit most of the data pretty well. Managers using analytics products must still decide how valid the model is, as well as how to deal with the outliers.

References

¹ D. Adler, et al, "RGL: A R-library for 3D visualization with OpenGL," http://rgl.neoscientists.org/arc/doc/RGL_INTERFACE03.pdf
² A. Vance, "Data Analysts Captivated by R’s Power," New York Times, January 6, 2009, http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2&
³ C. Fraley, et al, "mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation," The University of Washington, June 2012, https://www.stat.washington.edu/.../tr597.pdf.
⁴ C. Fraley and A. E. Fraftery, "Model-Based Clustering, Discriminant Analysis, and Density Estimation," Journal of the American Statistical Association, June 2002, p. 611.
⁵ Box, G. E. P., and Draper, N. R., (1987), Empirical Model Building and Response Surfaces, John Wiley & Sons, New York, NY, p. 424, http://en.wikiquote.org/wiki/George_E._P._Box.
⁶  T. H. Davenport, "Keep Up with Your Quants," Harvard Business Review, July 2013, http://goo.gl/BrWpD1.


© The Quant's Prism, 2015