Thursday, February 19, 2015

The theory underlying "The End of Theory" in Big Data. (updated)

Periods of social and economic change ignite our imaginations. New technologies and socioeconomic phenomena elicit both promise and dread.¹ We can, at the same time, see promises of utopia and specters of dystopian "Brave New Worlds."

Reality usually falls somewhere in between these extremes. Many anticipated that the Internet would "kill" distance and make the world flat.² Our planet nonetheless remains round³ — both geographically and socioeconomically. Modern transportation networks may reduce distance's relative importance. Other dimensions of culture, polity, and economics still matter as much as ever.⁴

The Internet was to lead to a "new economy." Website traffic rates supplanted net-free cash flow as the principal basis for valuing many firms. At its peak, America Online's market capitalization once exceeded that of Boeing. Corrections in the end returned familiar approaches to business valuation to a more-conventional place.⁵,


Genomics promised to revolutionize medicine. We now face however the reality of that science's limitations.  A recent study "confirms that genetics are a poor, if not purposeless, prognostic of the chance of getting a disease."⁷ Combined effects of nature and nurture prove difficult to unwind.

Big-Data analytics is following a similar pattern. Industries — particularly those based on technology — tend to follow lifecycle trajectories.⁸ Gartner uses a "hype-cycle" method to track these trajectories. The consultancy's most-recent analysis — in Figure 1 — shows gravity's affect on the Big-Data movement.¹¹,¹²  The mainstream press even recognizes this trend.¹²




Figure 1 — Gartner's 2014 "Hype cycle for emerging technologies" report shows Big Data sliding down from the "Peak of Inflated Expectations" towards the trough of disillusionment."¹⁰

The inaugural installment of Quant's Prism showed Big Data at the "peak of inflated expectations" on Gartner's curve. This blog's narrative consistently emphasizes the importance scientific discipline to business analytics. Value is derived through data science. Big Data's slide may be attributable in part to occasional lack of scientific rigor in its application.


Naïveté underlies the irrational exuberance accompanying highly-hyped, "cool" new technologies. This naïveté 
often involves discounting fundamental principals. Mother Nature nonetheless gets to vote — and her's counts. Mathematics — particularly statistics — is the science to which Big Data is subject.

Financial Times columnist Tim Harford — the "Undercover Economist" — describes four fundamental precepts of statistics that Big Data enthusiasts sometimes set aside.¹³  This installment of Quant's Prism summarizes Hartford's observations. I provide my own illustrations.



Setting science aside.

Harford summarizes four points of view comprising an integrated thought system. They appear in different places. Each idea tends however to be interconnected with the others. 

Theory doesn't matter.

Exuberant claims about applying machine-learning to large, complex data sets largely originate from the tech industry. Tech-industry evangelist Chris Anderson first suggested "The end of theory" hypothesis in Wired.¹⁴ Anderson grounds his rationale on statistician George Box' observation, "All models are wrong, but some are useful."

Google's experiment¹⁶ in the use of its search engine to tack the 2008 flu season is offered as the demonstration that proves the "end-of-theory" assertion. This anecdote regrettably failed the repeatability criterion. Scientific conclusions are only validated if they can be independently verified. Google's search-engine experiment failed to deliver the same success during subsequent years.¹⁷ Its results were neither repeatable, nor independently verifiable.

The absence of an underlying theory likely explains the absence of repeatability with Google's flu-tracking experiment. Google assumed relationships between users' medical conditions and their Internet searches. These relationships might change over time. Any such tracking model requires validation of the relationships and precise tracking of their change over time. In short, it takes a theory!

The sample can include the whole population (N → all).

"N →all" assertions largely occur related to Internet and social-media data. They use what Harford calls data "exhaust." Data exhaust includes things observed by tools that track website and smartphone use. These are the footprints connected users leave wherever they go in cyberspace. 

Some "N →all" enthusiasts conveniently assume that observations about online-exhaust data serve as a proxy for the population as a whole. These data in fact provide a biased view of the population. Adoption of exhaust-producing online technologies is far from uniform. 

Figure 2¹⁹ shows the unevenness of home broadband-access penetration within the U.S. Penetration by smartphones — another source of online-exhaust data — similarly varies by age²⁰ and ethnicity.²¹ Analyzing exhaust data only tells us about users of particular exhaust-producing technologies. Exhaust data tell us nothing about non-users. This is not equivalent to "N →all."

Figure 2 — Large samples of data derived from exhaust from home broadband internet will be biased according to the proportion of households with the service. (from Financial Times.¹⁹)


Analyzing the data alone can explain it all.

Leading statistician David Spiegelhalter observes, “There are a lot of small data problems that occur in big data. They don’t disappear because you’ve got lots of the stuff. They get worse.”¹³ This blog has already briefly explored the issue of dimensionality in large data sets. Analytics practitioners frequently battle "the curse of dimensionality."²¹

Simply throwing a pile of data at a canned algorithm rarely produces valuable insight. Systematic methods are required. Understanding and preparing the data typically requires 75% to 80% of the effort in a modeling project. Preparation is always guided by analysis of the business stated as a hypothesis to test a theory about what the data mean.²² These hypotheses systematically link possible explanations about the business problem to a mathematical description.²³

Correlation is enough, and causality doesn't matter.

The "causality doesn't matter" assertion appears in the May 2013 article of Foreign Affairs.¹⁵ Authors Cukier and Mayer-Schoenberger similarly base their arguments in the culture of the Internet. Anderson's, Cukier's, and Mayer-Schoenberger's common link to British news weekly Economist is interesting. This correlation, incidentally, does not predict that all Economist staffers hold this view.

Ars Tecnica Senior Science Editor John Timmer places correlations in their place: "Correlations are a way of catching a scientist's attention, but the models and mechanisms that explain them are how we make the predictions that not only advance science, but generate practical applications."²⁴ Correlations provide opportunities for further exploration involving hypothesis testing to infer underlying causalities. 


"False-positive" correlations frequently occur, particularly in high-dimensional data sets.²¹ An unexplained correlation at best provides a tenuous basis for action. This is especially true when substantial value is at risk. Basing big decisions on causal understandings is the safest path.

The health sciences offers a current example of shortcomings of correlation-only decision-making. The federal government's nutritional policies have been based largely on epidemiological studies, which are "observational" in nature.²⁶  That is, they attempt to draw conclusions solely from correlations.  The continuously-varying nutritional guidelines illustrate the challenge of Big-Data analysis that doesn't explain causality.

Approaching Big Data scientifically.

The bloom may be falling off the Big-Data rose. This is a completely normal — and healthy — phase of Big-Data evolution. Gartner's "Hype-Cycle" methodology anticipates this. A "trough of disillusionment" following a "peak of inflated expectations" corresponds to an "awkward adolescent" stage of life

Moore's (not to be confused with G. E. Moore of "Moore's Law" fame) lifecycle⁸ uses the terms "chasm," "tornado," and "bowling alley" to characterize the volatility of the early stages of technology life-cycles. Would-be gold-rushers exuberantly rush in, seeking alchemic payoffs. Most are disappointed.

Business leaders looking for competitive differentiation from Big Data must remain patient. Big Data will reach a "plateau of productivity." Proven, repeatable practices will emerge. These practices will be based — like all mature business practices — on systematic, scientific approaches. Early adopters must keep their heads. Sketches of pragmatic, deliberate approaches — exemplified by Davenport²² and Lavalle²⁵ — have already been in the literature for years.



References

¹ D. Bollier, The Promise and Peril of Big Data, The Aspen Institute, January 1, 2010, http://www.aspeninstitute.org/publications/promise-peril-big-data.
² T. L. Friedman, The World Is Flat, New York: Macmillan, 2007, http://goo.gl/q2UdPd
³ L. Prusak, "The world is round," Harvard Business Review, April 2006, https://hbr.org/2006/04/the-world-is-round. 
⁴ P. Ghemawat, World 3.0: Global Prosperity and How to Achieve It, Boston: Harvard Business Press, 2011, http://goo.gl/QLkVOK.
⁵ M. Porter, "Strategy and the Internet," Harvard Business Review, March 2001, https://hbr.org/2001/03/strategy-and-the-internet.
⁶ C. M. Reinhart and K. Rogoff, This Time Is Different: Eight Centuries of Financial Folly, Princeton, NJ: Princeton University Press, http://goo.gl/C5Zfnc
⁷ D. Shenk, "The Limits of Genetic Testing," The Atlantic, April 3, 2013, http://www.theatlantic.com/health/archive/2012/04/the-limits-of-genetic-testing/255416/
 G. Moore, "Darwin and the demon: Innovating within established enterprises," Harvard Business Review, July 2004, https://hbr.org/2004/07/darwin-and-the-demon-innovating-within-established-enterprises
¹⁰ J. Rivera, "Gartner's 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business," The Gartner Group, http://www.gartner.com/newsroom/id/2819918.
¹¹ M. Wheatley, "Gartner’s Hype Cycle: Big Data’s on the slippery slope," SiliconANGLE, August 19, 2014, http://siliconangle.com/blog/2014/08/19/gartners-hype-cycle-big-datas-on-the-slippery-slope/.
¹² G. Langer, "Growing Doubts About Big Data," ABC News, April 8, 2014, http://abcnews.go.com/blogs/politics/2014/04/growing-doubts-about-big-data/.
¹³ T. Harford, Big data: are we making a big mistake?" Significance, The Royal Statistical Society, December 2014, pp. 14 - 19,  http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2014.00778.x/abstractFinancial Times, March  28, 2014, http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html.
¹⁴ C. Anderson, "The end of theory: The data deluge makes the scientific method obsolete," Wired, June 23, 2008, http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory.
¹⁵ K. N. Cukier and V. Mayer-Schoenberger, "The rise of big data," Foreign Affairs, May/June 2013, pp. 28-36, http://goo.gl/3tGMZc.
¹⁶ J. Ginsberg, et al, Detecting influenza epidemics using search engine query data," Nature, November 19, 2008, http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html, (republished by Google to http://goo.gl/Ly0bwR).
¹⁷ H. Hodson, "Google Flu Trends gets it wrong three years running," NewScientist — Health, March 13, 2014, http://www.newscientist.com/article/dn25217-google-flu-trends-gets-it-wrong-three-years-running.html#.VOQZq7DF-oU.
¹⁸ D. Crow, "Digital divide exacerbates US inequality," Financial Times, October 28, 2014, http://www.ft.com/intl/cms/s/2/b75d095a-5d76-11e4-9753-00144feabdc0.html#axzz3J94dMqob.
¹⁹ "US Smartphone Penetration by Ethnicity, Q12012," Neilson, June 1, 2012, accessed from Beehive Group, http://beehive.me/2012/06/us-smartphone-penetration-by-ethnicity/
²⁰ "U.S. smartphone penetration by age group, 2Q2014," MarketingCharts, http://goo.gl/VCvjDt.
²¹ T. Hastie, et alThe Elements of Statistical Learning, second edition, Springer, 2009, http://goo.gl/ipNrMUhttp://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf.
²² T. H. Davenport, "Keep up with your quants," Harvard Business Review, July 2013, http://goo.gl/BrWpD1.
²³ D. R. Cox and D. V. Hinkley, Theoretical Statistics, Boca Raton, FL: CRC Press, 1996, http://goo.gl/1zw8H0.
²⁴ J. Timmer, "Why the cloud cannot obscure the scientific method," Ars TecnicaJune 25, 2008, http://arstechnica.com/uncategorized/2008/06/why-the-cloud-cannot-obscure-the-scientific-method/.
²⁵ S. Lavalle,  et al, "Big data, analytics and the path from insights to value," MITSloan Management Review, Winter 2011, http://goo.gl/8RSn5H.
²⁶ N. Teicholz, "The government’s bad diet advice," Washington Post, February 20, 2015, http://goo.gl/aeyNwC.







© The Quant's Prism, 2015