Saturday, March 12, 2016

The role of analytics in competitive strategy.

Successful competitive differentiation rarely occurs by accident. It comes from careful deliberation and disciplined execution. Big Data and analytics offer considerable opportunities to improve both competitive intelligence and operational differentiation.

For example, McKinsey¹ surveyed twelve emerging technological trends likely affect profound disruption of industries and cultures. Analytics was not explicitly mentioned. Analytics-based capabilities are however enabling contributors to most of them. 

Analytics similarly enables Porter's² innovation "modes" leading to competitive advantage. Teece's³ dynamic capabilities "microfoundations" of sustainable competitive advantage are similarly aided by analytics. I defer elaboration on this assertions to subsequent installments.

Although caution should be applied in treating analytics as "just another technology"⁴, frameworks for strategic alignment of technology in fact lend themselves to many aspects of analytics. Ross' method for treating enterprise architecture as an essential aspect of strategy⁵ exemplifies. Merrifield's method for capabilities analysis⁶ similarly lends itself to identifying hypotheses testable by analytics.⁷

My discussion here considers yet another framework. Nolan⁸ considers fundamental questions about the role of IT in corporate strategy. He uses a strategic-impact grid to derive answers. His framework lends itself to answering similar questions about analytics.

A strategic-impact grid for analytics.

Figure 1 shows Nolan's strategic-impact grid adapted for analytics.  Nolan, et al, focus on two dimensions of IT's contribution to a business. Figure 1 paraphrases these as:

  • Nature of analytics' contribution to competitive advantage; and
  • Tolerance for shortfalls in information availability, accuracy.
Following Nolan, I discretize this two-dimensional space into four regions. This obviously represents a simplification. For instance, analytics' contribution to a particular firm's competitive approach may fall somewhere on a continuum between "Offensive" and "Defensive". Consistent however with Box' philosophy on modeling, a model such as ours need not necessarily capture all aspects of a phenomenon with arbitrary precision in order to be useful for explanation.

Figure 1 — Strategic-impact grid for analytics:  Organizational modes and role of analytics in their operations.  After Nolan.⁸



The "nature-of-contribution" dimension answers the question, "How does analytics contribute to my competitive differentiation?" We say that our analytics is competitively offensive if it provides the basis for competitive differentiation.  If alternatively analytics is necessary simply to maintain competitive parity, its contribution is categorized as defensive.

The "tolerance for shortfall" pertains to the business-criticality or -essentiality of the analytics capability. I borrow here a useful concept from U.S. Federal policy for the acquisition of IT.¹⁰ Low-tolerance analytics capabilities contribute directly to the delivery of the organization's primary offering to customers. The organization's tolerance for lapses in analytics-capability availability or accuracy is low if such lapses adversely impact meeting commitments to customers.


The four modes of analytics contribution to strategy.

Subdividing our two-dimensional strategy-impact model in Figure 1 into four quadrants yields four modes of strategic contribution by analytics.
Strategic Mode. The strategic mode applies to firms for which analytics provides a basis for competitive differentiation. Analytics-derived information contributes directly differentiation the organization's product or service. Strategic-mode firms develop and exploit information asymmetries.¹¹ Three obvious approaches to information asymmetry occur:
  • Access to data not available to competitors (e.g., Google);
  • Extraction of superior economic utility from information (e.g., Walmart); or
  • Differentiated ability to extract information from data that are generally available to others (e.g., Kaspersky lab).
Factory Mode.  Factory-mode organizations rely upon analytics for business-essential functions. They do not however necessarily achieve information asymmetries. Factory-mode firms seek to maintain at least information parity.

Distinction between information asymmetry and information parity is the essential point here. Analytics-based capabilities are effectively "table stakes" for survival many industries. Business-criticality may lead such firms to claim that they "compete on analytics". Analytics-based capabilities are not however basis for competitive differentiation. This logic mirrors the argument from Carr's controversial 2003 article "IT doesn't matter."¹² 

On what do factory-mode organizations base their analytics capabilities? They tend to apply mainstream, out-of-the-box tools to generally available data sources. Cloud-based analytics tools¹⁵ commoditize many powerful data-mining capabilities. Consequently, maintaining business-essential information parity is no longer a luxury in many industries. 

Turnaround Mode. Turnaround-mode organizations use analytics to reposition themselves strategically. Despite the location of the turnaround-mode quadrant in Figure 1, these organizations may either try for information parity or information asymmetries. 

Introspective diagnostics or extrospective competitive baselining may employ capabilities yielding information parity. Experimenting with new offerings or delivery models¹⁶ may entail exploration of operational approaches based on information asymmetries.

Support Mode. Not all industries necessarily lend themselves to differentiation based on information asymmetries. Nolan⁸ singled out industries with a creative focus as support-mode in IT terms. This often applies to analytics as well.

Recent research¹⁷ regarding the fashion industry suggests weaknesses in the ability of mainstream analytics to anticipate demand in the financial industry. Limitations in the abilities of predictive algorithms to anticipate movements by financial markets similarly exemplify scenarios in which analytics-derived information asymmetries remain problematic.
 

Analytics-based competitive differentiation comes down to information asymmetry.

Analytics capabilities are becomingly increasingly ubiquitous. This follows the near-pervasive penetration of information technologies a decade and a half ago. Technology-based capabilities — having achieved ubiquity — yield only tenuous bases for competitive differentiation. They become necessary for competitive parity. But in terms of differentiation, they cease to matter.

Really, how do we achieve information asymmetry? The next Quant's Prism installment addresses this through a variation on a framework by Davenport.¹⁸ I introduce a strategic-alignment framework focused on linking data science to fundamental phenomenology of the business.
 


References

¹ "Disruptive technologies:  Advances that will transform life, business, and the global economy," McKinsey Global Institute, McKinsey & Company, May 2013, http://goo.gl/HWV6ly.
² M. Porter, Competitive Advantage of Nations, New York:  Free Press (Macmillan), 1990, pp. 45 - 47, http://goo.gl/ntAQL8.
³ D. Teece, Dynamic Capabilities, Oxford, UK: Oxford University Press, 2009, Kindle edition, Loc 62 - 639, http://goo.gl/yAOa03.
⁴ D. A. Marchand and J. Peppard, "Why IT fumbles analytics", Harvard Business Review, Jan - Feb 2013, https://goo.gl/fbfPHe.
⁵ J. W. Ross, P. Weill, and D. C. Robertson, Enterprise Architecture as Strategy, Boston:  Harvard Business School Publishing, 2006, http://goo.gl/YX1cVQ.
⁶ R. Merrifield, J. Calhoun, and D. Stevens, "The next revolution in productivity," Harvard Business Review, June 2008, https://goo.gl/j96ktR.
⁷ T. Davenport, "Keep up with your quants," Harvard Business Review, July - Aug 2013, https://goo.gl/sY5HTo.
⁸ R. Nolan and F. W. McFarlan, "IT and the Board of Directors," Harvard Business Review, October 2005, https://goo.gl/AKbG7P.
⁹ G. E. P. Box, "Science and statistics," Journal of the American Statistical Association, December 1976, pp. 791 - 798, http://goo.gl/x6Fe6B.
¹⁰ "Title 40/Clinger-Cohen Act (CCA) Compliance Table", Defense Acquisition Guidebook, https://goo.gl/5VUQ2V.
¹¹ U. Birchler and M. Bütler, Information Economics, New York: Routledge (Tayor & Francis), 2007, http://goo.gl/5hB9Ra.
¹² N. Carr, "IT doesn't matter", Harvard Business Review, May 2003, pp. 41 - 49, https://goo.gl/7gCYJX.
¹³ J. King, "How analytics helped Ford turn its fortunes," Computerworld, December 2, 2013, http://goo.gl/C8h7tS.
¹⁴ C. Boulton, "Navistar CIO looks to big data analytics to fuel turnaround", November 30, 2015, http://goo.gl/6Sbpo8.
¹⁵ D. Henschen, "10 Cloud Analytics & BI Platforms For Business", InformationWeek, January 22, 2015, http://goo.gl/IX7av9.
¹⁶ S. Thomke and J. Manzi, "The discipline of business experimentation," Harvard Business Review, December 2014, https://goo.gl/OLRNsF.
¹⁷ M. Seifert, et al, "Effective judgmental forecasting in the context of fashion products," Journal of Operations Management, Elsevier, http://goo.gl/bRt88Y.
¹⁸ T. Davenport, Analytics at Work, Boston: Harvard Business School Publishing, 2010, http://goo.gl/nfYtlo

Sunday, May 24, 2015

Ivy-league payoffs: An illustration of causality and determinism

We assume that Ivy-League graduates have it made. Admission to an Ivy leaves one set for life. Not getting into an Ivy consigns a young person to a life of mediocrity. One's alma mater is one's fate, after all. 

But is this really true?  Do the data support this assertion? Research into questions like these lands us smack dab into the middle of two conundrums central to Big-Data Analytics. First, how do we tell what causes what? And second, which factors are the strongest determinants of an outcome?


Figure 1 — A Harvard diploma represents the ultimate capstone for a successful youth. (Source: World Press.) 

This time of year provides us an opportune case study to consider these Big-Data analytics challenges.¹ High school seniors last month were informed of their destinies. April is college-admissions season. Acceptance and rejection letters fulfilled dreams and shattered hopes for many an aspiring collegian.

Meanwhile, college seniors graduate this month. They are in the midst of realizing their destinies. Corporate recruiters, networking, and the formidable quest for employment now beckon. Those fortunate enough to stroll across an Ivy commencement stage are certainly predestined for greatness. Are they not?

Scientific research fortunately allows us to examine these widely held presuppositions. We consider them here from perspective of our two conundrums. The first is causality. To what extent does the name of the university on one's university diploma really determine future earnings?  Is the relationship casual? Does one thing inevitably lead to the other?

Secondly, we care about determinism. Even if alma mater causally determines future earnings, are other factors more important? Do other factors have a bigger say?

Causality:  Is it the school or is it the applicant?

Researchers Stacy Berg-Dale and Alan B. Krueger looked into the exact question of alma mater versus earnings.² They took an innovative tack on the question however. They mined a data set for both cognitive and non-cognitive indicators of students' competitiveness for elite colleges. The cognitive factors included the usual suspects: Grade Point Averages and SAT scores.

The non-cognitive characteristics included the "soft" skills. These include indications of motivation, ambition, and maturity. These attributes are evidenced in interviews, essays, and letters of recommendation. They are obviously more difficult to measure. 

Scientific evidence showing that "soft" skills are stronger predictors of life outcomes than book knowledge is now a couple of decades old.³ Education research⁴ demonstrates their unsurprising importance to learning.

Figure 2 highlights Berg-Dale's and Krueger's research. Figure 2(a) represents the "alma-mater-is-destiny" hypothesis we commonly assume. Figure 2(b) shows the model our researchers applied to their data. They used regression analysis. Figure 2(a) presumes causality. The researchers' method — regression analysis — only uses correlation.


Figure 2 — Dale-Berg and Krueger² used a predictive-modeling technique based solely on correlations to try to understand causal relationship between alma mater and earnings.

Figure 2(c) shows an alternative view. Dale-Berg and Krueger hypothesize that this model might explain the data better than Figure 2(a). It's the applicant, not the university! The traits that make one competitive for admission to an elite university tend to translate into higher earnings after college.  Alma-mater affiliation only marginally contributes to increased earnings.

The alma-mater-is-destiny presupposition reverses the actual causalities underlying students' prospects. It's not the school. It's the competitive traits of the applicant. We suspect that being immersed in an environment replete with intelligent, ambitious peers probably confers some advantages. The tenacity and talent required to become competitive — qualified but not selected — to join that peer group may largely offset many advantages of actually joining the group.

Note that the researchers do not explicitly demonstrate the causality in Figure 2(c). They use correlation — the model in Figure 2(b) — to cast serious doubt on the presumption in Figure 2(a). They then offer an alternative hypothesis. The alternative explanation in Figure 2(c) remains a hypothesis! This is an unavoidable limitation of their regression-based approach.


Determinism: Just how much does the school really matter?

Determinism — just how much each factor contributes — is equally important. Berg-Dale and Krueger found that alma mater makes a weak relative contribution to post-college earnings. I use another closely-related study to illustrate determinism.

Gaertner and McCarty⁵ studied factors that predict college readiness. They wanted to diagnose the traits that lead to college success. Educators seek to anticipate students' success and challenges starting as early as the eighth grade.

Figure 3 summarizes results from Gaertner's and McCarty's research. Like Dale-Berg and Krueger, they used regression analysis. They looked at four indicators of college readiness. They also considered four factors believed to determine college readiness. "Middle-school indicators" in Figure 3 refers to a set of non-cognitive factors, like those in Figure 2.


Figure 3 — We explain college readiness in terms of multiple factors, each of which explains a fraction of the overall readiness. (After Gaertner and McClarty.⁵)

Quants evaluate their regression models using a quantity called "coefficient of determination", or "R²". Figure 3 shows this quantity from Gaertner's and McCarty's analysis. R² explains the proportion of total variation contributed by each factor in the model. We see in Figure 3 that SAT and High-School GPA (HSGPA) contributed most to explaining students' college readiness. The "Middle-school indicators" made the second-largest contribution.

There is a problem with R² however. Specifically, the R²'s for all the factors must sum up to a total less than one. So the more factors you mix into your model, the smaller the relative contribution each may make

This is particularly true of the factors are correlated with each other. Correlated factors "divide the credit" in regression analysis. It is hard to tell which really contributes the most. Regression analysis cannot by itself tell you that one factor causes another! We probably want to assign a higher weight to a root cause than to some intermediate affect. Regression cannot do this.


The limits of regression analysis.

Regression analysis is the most widely-used technique in the Big-Data Analytics tookbox. We see here two indications of its limitations. First, it does not account for — or indicate — causality. It simply uses knowledge about how things are correlated with each other.

This can be a very big deal! Root causes are vitally important to many business decisions. Analytics results based on regression analysis alone have limited value for root-cause analysis. By themselves, they tell us very little about cause and Affect.

Figure 4 illustrates this. Figure 4(a) shows a notional scenario comprised of a chain of cause-and-affect relationships. We are interested in the "Affect of Primary Interest" — or primary affect. This affect mostly occurs because of a single root cause. The root cause does not directly lead to the primary affect. It acts through a pair of intermediate affects.

Figure 4(b) illustrates how regression analysis treats this scenario. Our primary affect is directly correlated with the root cause and with the two intermediate affects. Regression analysis cannot distinguish between the three. There is no guarantee that the regression model will even assign the highest weight to the root cause!


Figure 4 — Analytics methods based on regression analysis cannot detect root causes in chains of causal events. 

Many applicants to elite universities suffer acutely from not understanding this kind of causality. They experience enormous stress during college-admissions season. Lots of them succumb to a winner-take-all view of university selections. 

This is a problem for at least two reasons. First, an ambitious, disciplined, talented young person tends to come out okay irrespective of where she goes to school. Second, seventeen-year-olders simply don't know enough about themselves — or about life — to make rest-of-their-life decisions anyway. The stakes are simply not that high!

Determinism represents the second issue with regression analysis. Regression analysis uses weighted sums. We're simply adding things up. These sums can only be so large. A regression model may not even assign the highest weight to the root cause in a causal chain of events! This oversimplifies the world. It also provides an incomplete view.

The real world consists of complex, causal interrelationships. Life is chocked full of vicious and virtuous cycles! Correlation alone doesn't really explain these very well. It only tells us that some things tend to go together — that they are "co-related." 

Leaders using analytics to make decisions must understand how their analysts formulate recommendations.⁶ Big-data analytics contains no alchemy. It offers neither sorcery nor black magic. Decision makers must apply extreme caution to analytics results that cannot be explained apart from, "This is what the tool says."



References

¹ D. Thompson, "'It Doesn't Matter Where You Go to College': Inspirational, but Wrong," The Atlantic, April 2, 2015, http://www.theatlantic.com/business/archive/2015/04/the-3-percent-crisis/389396/.
² S. B. Dale and A. B. Krueger, "Estimating the payoff to attending a more selective college: An application of Selection on observables and unobservables", Quarterly Journal of Economics, Oxford Journals, December 2002, http://qje.oxfordjournals.org/content/117/4/1491.short.
³ D. Coleman, Emotional Intelligence, Random House, 2002, http://goo.gl/PD5sJf.
⁴ C. A. Farrington, et al, "Teaching Adolescents To Become Learners The Role of Noncognitive Factors in Shaping School Performance: A Critical Literature Review", University of Chicago Consortium of Chicago School Research (CCSR), June 2012, https://goo.gl/TJpXGl.
⁵ M. N. Gaertner and K. L. McClarty, "Performance, Perseverance, and the Full Picture of College Readiness", Educational Measurement: Issues and Practice, National Council on Measurements in Education, April 2015, http://onlinelibrary.wiley.com/doi/10.1111/emip.12066/abstract.
⁶ T. H. Davenport, "Keeping up with your quants," Harvard Business Review, July-Aug 2013, http://goo.gl/rtkq8G.


Saturday, March 21, 2015

Sports analytics illustrates cultural and practical factors foranalytics adoption.

Sports analytics is really sexy. Michael Lewis' 2004 best-seller Moneyball¹ put it on the map. A subsequent movie staring Brad Pitt didn't hurt things much, either. The MIT-Sloan School of Business even holds an annual conference on the topic.

We might expect — given all they hype — that analytics' reign in sports is unchallenged. The Moneyball movie after all places words to that effect in the mouth of Boston Red Sox' owner John Henry. The Henry character suggests that any organization not reinventing itself after the Billy Beane model would become a dinosaur.

Analytics does not in fact rule sports. The web page for sports network ESPN recently published a survey ranking major sports teams by their extent of analytics adoption.² Adoption was very non-uniform within major sports and between sports.



Figure 1 — Sports website espn.com recently published a survey² on the adoption of analytics in sports. Not only did the extent of adoption between sports vary, but also within sports.

What explains this disparity? At least two factors come into play. Much has been written about the role of culture in analytics adoption. Some businesses just lend themselves more to analytics. I briefly explore both factors here.


Cultures of data.

ESPN's survey shows that analytics plays a bigger role in baseball than other major U.S. sports. ESPN characterized each team in each of the four major U.S. professional sports leagues — Baseball, Basketball, Football, Hockey — in terms of five tiers of adoption:
  • All-in;
  • Believers; 
  • One foot in;
  • Skeptics; and
  • Non-believers.
Sixteen of 30 professional-baseball teams fell into one of the top two categories. Basketball came in second, with twelve of 30 major franchises in the top two categories. 

Lavalle³ and Davenport⁴ consider extensively the role in culture to the adoption of analytics. Lavelle writes, "The adoption barriers that organizations face are managerial and cultural, rather than related to data and technology." Davenport's Analytics at Work devotes an entire chapter to "Building an analytics culture."


Baseball has a long, rich history of statistics. A college dormitory roommate comes to mind as an example. This young man often had three things going on at once as I entered the room. The television was on, with the volume turned all the way down. The radio was playing music. And he had a six-inch-thick encyclopedia of baseball statistics open on his lap. I saw no evidence of athletic participation in him. But he devoured baseball statistics.

A casual search of Amazon.com brings up volumes on baseball statistics. The legendary Bill James baseball abstract⁵ contains more than 1,000 pages of baseball history, rich with statistics. An annual update⁶ contains nearly 600 pages describing analytical methodologies. Baseball America publishes an annual edition to a baseball almanac⁷ presenting "a comprehensive statistical review from the majors all the way through to youth baseball." For the quant each of us knows and loves, there is even a book⁸ on using the R programming language to analyze baseball statistics.

Theories underlying basketball statistics have arrived more recently. Amazon yields fewer hits overall. A math professor and a software engineer collaborated on a recent book⁹ describing statistical methods for basketball. That this work was not picked up by a major commercial publisher might suggest something about the market for basketball statistics.


Susceptibility to mathematical analysis.

Some activities just lend themselves better to mathematical analysis. The business literature recognizes that some business activities should remain more flexible and less structured.¹⁰ Human endeavors can be characterized by their locations along a continuum spanning from "pure art" to "pure science."

Activities that are more scientific easily lend themselves to mathematical specification. Hypothesis testing lies at the heart of the scientific method. I must specify an activity with mathematical precision before I apply analytics to it. Robin Williams' character in the 1989 movie Dead Poets Society gives us an illustrative mockery of the problems with applying scientific methods to artistic endeavors.

Among major U.S. professional sports, baseball happens to lend itself to mathematical analysis. The game is highly structured. The flow of a baseball game is characterized by discrete states. These states are characterized by innings and outs, strikes and pitches, players on base, points scored. 

Each play — commencing with the pitcher's release of the ball towards the batter — creates the the opportunity to move from one state to the next. Individual player's abilities — hitting, running, throwing, catching — determine their ability to contribute to transition to a next state that is more favorable to his team than the previous. Player statistics are largely based on how well their play moves the game to more-advantageous state.

Baseball enjoys an underlying structure that is inherently mathematical. A set of mathematical methods exists focused on characterizing transitions from one discrete to another. This set of methods is referred to as Markov Chains.¹¹ Baseball-statistics aficionados may not think about Markov chains. But they are there.

Tennis — not one of the "Big Four" U.S. professional sports — enjoys a similarly mathematical structure. The flows of tennis games progress through a discrete set of states. A statistical researcher recently described a model for tennis matches.¹²

The remaining three of the "Big Four" U.S. professional sports lack the underlying mathematical structure of baseball and tennis. Football is arguably the most structured. Beyond ball possession and points scored, scoring drives follow through discrete states characterized by down, yards to go, and location of line of scrimmage. 

Three of these states are continuous. The line of scrimmage and yards to go are represented by continuous numbers. Transitions from one state to the next are also continuous in nature. These factors lend themselves to statistical analysis less easily than discrete states and transitions of baseball.

Basketball and hockey are even less structured. Game flows result from continuous, random interactions between players. ESPN's survey finds that analytics adoption in basketball is more advanced than in hockey. This may be because basketball teams are smaller. Contributions by individual players have a greater bearing on the flow of play.


The path to analytics adoption.

Becoming an analytics organization involves both cultural and practical aspects of change. Analytics-driven cultures enjoy the distinct quality of thinking quantitatively. Members of such organizations habitually measure aspects of their work. They benchmark measurements of the most important aspects of their activities. They also measure performance against those benchmarks.

Baseball, among major U.S. professional sports, enjoys a culture that is substantially analytical. Baseball statistics occupy a prominent place in the sport's fandom. Statistics are sufficiently important to the business of baseball for major commercial publishers to release large volumes.

Explaining the prominence of statistics in baseball in particular may be a "chicken-or-the-egg" question. That the structure of the game is fundamentally mathematical in nature certainly does not introduce any obstacles to a popular baseball-statistics sub-culture. 

Organizational leaders seeking to inculcate analytics more deeply into their organizations must manage cultural change, first and foremost. Cultural change is perhaps the most formidable of management challenges. Management guru Peter Drucker's dictum "Culture eats strategy for breakfast" states the challenge well.

Leaders must also recognize the limitations of analytics. In order to apply analytics to a problem, we first must describe it with mathematical precision. Not everything does — or necessarily should — lend itself to this degree of precision.¹⁰

Assertions by exuberant advocates suggest that anything can be measured.¹³ Many such improvisational measurements require selection of proxies — substitutes for quantities that cannot be directly observed. Such proxies may lack precision — or sufficiently direct correlation to the desired quantity — to usefully reduce uncertainty. 

Methods from the economics of information give us a rigorous approach to quantify the value of marginal uncertainty. Measurements and reporting based on proxies may not yield information-economic returns worthy of the required investment. Sports analytics gives us practical illustrations. 




References

¹ M. Lewis, Moneyball — The Art of Winning an Unfair Game, Norton, 2004, http://goo.gl/f7N2Yu.
² "The great analytics rankings," espn.com, February 25, 2015, http://espn.go.com/espn/feature/story/_/id/12331388/the-great-analytics-rankings.
³ S. Lavalle,  et al, "Big data, analytics and the path from insights to value," MITSloan Management Review, Winter 2011, http://goo.gl/8RSn5H.
⁴ T. H. Davenport, Analytics at work, Boston: 2010, Harvard Business Review Press, http://goo.gl/olZkKm.
⁵ B. James, The New Bill James Historical Baseball Abstract, New York: Free Press, June 13, 2003, http://goo.gl/Q7a0iA.
⁶ B. James, The Bill James Handbook 2015, Chicago: ACTA Publications, October 31, 2014, http://goo.gl/OEHinJ.
⁷ Baseball America 2015 Almanac: A Comprehensive Review of the 2014 Season, Baseball America, January 6, 2015, http://goo.gl/xCenv9.
⁸ M. Marchi and J. Albert, Analyzing Baseball Data with R, Boca Raton, FL: October 29, 2013, http://goo.gl/gx4J7r.
⁹ S. M. Shea  and C. E. Baker, Basketball Analytics: Objective and Efficient Strategies for Understanding How Teams Win, CreateSpace Independent Publishing Platform, November 5, 2013, http://goo.gl/rE36pI.
¹⁰ J. M. Hall and M. E. Johnson, "When should a process be art, not science?" Harvard Business Review, March 2009, http://goo.gl/q6rGKB.
¹¹ J. R. Norris, Markov Chains, London: Cambridge University Press, Jul 28, 1998, http://goo.gl/J0nV8S.
¹² C. Gray, "Game, set and stats," Significance, Royal Statistical Society, February 3, 2015, http://goo.gl/8wdgH7.
¹³ D. W. Hubbard, How to Measure Anything: Finding the Value of Intangibles in Business, New York: Wiley, 2014, http://goo.gl/cFuFTb.

Thursday, February 19, 2015

The theory underlying "The End of Theory" in Big Data. (updated)

Periods of social and economic change ignite our imaginations. New technologies and socioeconomic phenomena elicit both promise and dread.¹ We can, at the same time, see promises of utopia and specters of dystopian "Brave New Worlds."

Reality usually falls somewhere in between these extremes. Many anticipated that the Internet would "kill" distance and make the world flat.² Our planet nonetheless remains round³ — both geographically and socioeconomically. Modern transportation networks may reduce distance's relative importance. Other dimensions of culture, polity, and economics still matter as much as ever.⁴

The Internet was to lead to a "new economy." Website traffic rates supplanted net-free cash flow as the principal basis for valuing many firms. At its peak, America Online's market capitalization once exceeded that of Boeing. Corrections in the end returned familiar approaches to business valuation to a more-conventional place.⁵,


Genomics promised to revolutionize medicine. We now face however the reality of that science's limitations.  A recent study "confirms that genetics are a poor, if not purposeless, prognostic of the chance of getting a disease."⁷ Combined effects of nature and nurture prove difficult to unwind.

Big-Data analytics is following a similar pattern. Industries — particularly those based on technology — tend to follow lifecycle trajectories.⁸ Gartner uses a "hype-cycle" method to track these trajectories. The consultancy's most-recent analysis — in Figure 1 — shows gravity's affect on the Big-Data movement.¹¹,¹²  The mainstream press even recognizes this trend.¹²




Figure 1 — Gartner's 2014 "Hype cycle for emerging technologies" report shows Big Data sliding down from the "Peak of Inflated Expectations" towards the trough of disillusionment."¹⁰

The inaugural installment of Quant's Prism showed Big Data at the "peak of inflated expectations" on Gartner's curve. This blog's narrative consistently emphasizes the importance scientific discipline to business analytics. Value is derived through data science. Big Data's slide may be attributable in part to occasional lack of scientific rigor in its application.


Naïveté underlies the irrational exuberance accompanying highly-hyped, "cool" new technologies. This naïveté 
often involves discounting fundamental principals. Mother Nature nonetheless gets to vote — and her's counts. Mathematics — particularly statistics — is the science to which Big Data is subject.

Financial Times columnist Tim Harford — the "Undercover Economist" — describes four fundamental precepts of statistics that Big Data enthusiasts sometimes set aside.¹³  This installment of Quant's Prism summarizes Hartford's observations. I provide my own illustrations.



Setting science aside.

Harford summarizes four points of view comprising an integrated thought system. They appear in different places. Each idea tends however to be interconnected with the others. 

Theory doesn't matter.

Exuberant claims about applying machine-learning to large, complex data sets largely originate from the tech industry. Tech-industry evangelist Chris Anderson first suggested "The end of theory" hypothesis in Wired.¹⁴ Anderson grounds his rationale on statistician George Box' observation, "All models are wrong, but some are useful."

Google's experiment¹⁶ in the use of its search engine to tack the 2008 flu season is offered as the demonstration that proves the "end-of-theory" assertion. This anecdote regrettably failed the repeatability criterion. Scientific conclusions are only validated if they can be independently verified. Google's search-engine experiment failed to deliver the same success during subsequent years.¹⁷ Its results were neither repeatable, nor independently verifiable.

The absence of an underlying theory likely explains the absence of repeatability with Google's flu-tracking experiment. Google assumed relationships between users' medical conditions and their Internet searches. These relationships might change over time. Any such tracking model requires validation of the relationships and precise tracking of their change over time. In short, it takes a theory!

The sample can include the whole population (N → all).

"N →all" assertions largely occur related to Internet and social-media data. They use what Harford calls data "exhaust." Data exhaust includes things observed by tools that track website and smartphone use. These are the footprints connected users leave wherever they go in cyberspace. 

Some "N →all" enthusiasts conveniently assume that observations about online-exhaust data serve as a proxy for the population as a whole. These data in fact provide a biased view of the population. Adoption of exhaust-producing online technologies is far from uniform. 

Figure 2¹⁹ shows the unevenness of home broadband-access penetration within the U.S. Penetration by smartphones — another source of online-exhaust data — similarly varies by age²⁰ and ethnicity.²¹ Analyzing exhaust data only tells us about users of particular exhaust-producing technologies. Exhaust data tell us nothing about non-users. This is not equivalent to "N →all."

Figure 2 — Large samples of data derived from exhaust from home broadband internet will be biased according to the proportion of households with the service. (from Financial Times.¹⁹)


Analyzing the data alone can explain it all.

Leading statistician David Spiegelhalter observes, “There are a lot of small data problems that occur in big data. They don’t disappear because you’ve got lots of the stuff. They get worse.”¹³ This blog has already briefly explored the issue of dimensionality in large data sets. Analytics practitioners frequently battle "the curse of dimensionality."²¹

Simply throwing a pile of data at a canned algorithm rarely produces valuable insight. Systematic methods are required. Understanding and preparing the data typically requires 75% to 80% of the effort in a modeling project. Preparation is always guided by analysis of the business stated as a hypothesis to test a theory about what the data mean.²² These hypotheses systematically link possible explanations about the business problem to a mathematical description.²³

Correlation is enough, and causality doesn't matter.

The "causality doesn't matter" assertion appears in the May 2013 article of Foreign Affairs.¹⁵ Authors Cukier and Mayer-Schoenberger similarly base their arguments in the culture of the Internet. Anderson's, Cukier's, and Mayer-Schoenberger's common link to British news weekly Economist is interesting. This correlation, incidentally, does not predict that all Economist staffers hold this view.

Ars Tecnica Senior Science Editor John Timmer places correlations in their place: "Correlations are a way of catching a scientist's attention, but the models and mechanisms that explain them are how we make the predictions that not only advance science, but generate practical applications."²⁴ Correlations provide opportunities for further exploration involving hypothesis testing to infer underlying causalities. 


"False-positive" correlations frequently occur, particularly in high-dimensional data sets.²¹ An unexplained correlation at best provides a tenuous basis for action. This is especially true when substantial value is at risk. Basing big decisions on causal understandings is the safest path.

The health sciences offers a current example of shortcomings of correlation-only decision-making. The federal government's nutritional policies have been based largely on epidemiological studies, which are "observational" in nature.²⁶  That is, they attempt to draw conclusions solely from correlations.  The continuously-varying nutritional guidelines illustrate the challenge of Big-Data analysis that doesn't explain causality.

Approaching Big Data scientifically.

The bloom may be falling off the Big-Data rose. This is a completely normal — and healthy — phase of Big-Data evolution. Gartner's "Hype-Cycle" methodology anticipates this. A "trough of disillusionment" following a "peak of inflated expectations" corresponds to an "awkward adolescent" stage of life

Moore's (not to be confused with G. E. Moore of "Moore's Law" fame) lifecycle⁸ uses the terms "chasm," "tornado," and "bowling alley" to characterize the volatility of the early stages of technology life-cycles. Would-be gold-rushers exuberantly rush in, seeking alchemic payoffs. Most are disappointed.

Business leaders looking for competitive differentiation from Big Data must remain patient. Big Data will reach a "plateau of productivity." Proven, repeatable practices will emerge. These practices will be based — like all mature business practices — on systematic, scientific approaches. Early adopters must keep their heads. Sketches of pragmatic, deliberate approaches — exemplified by Davenport²² and Lavalle²⁵ — have already been in the literature for years.



References

¹ D. Bollier, The Promise and Peril of Big Data, The Aspen Institute, January 1, 2010, http://www.aspeninstitute.org/publications/promise-peril-big-data.
² T. L. Friedman, The World Is Flat, New York: Macmillan, 2007, http://goo.gl/q2UdPd
³ L. Prusak, "The world is round," Harvard Business Review, April 2006, https://hbr.org/2006/04/the-world-is-round. 
⁴ P. Ghemawat, World 3.0: Global Prosperity and How to Achieve It, Boston: Harvard Business Press, 2011, http://goo.gl/QLkVOK.
⁵ M. Porter, "Strategy and the Internet," Harvard Business Review, March 2001, https://hbr.org/2001/03/strategy-and-the-internet.
⁶ C. M. Reinhart and K. Rogoff, This Time Is Different: Eight Centuries of Financial Folly, Princeton, NJ: Princeton University Press, http://goo.gl/C5Zfnc
⁷ D. Shenk, "The Limits of Genetic Testing," The Atlantic, April 3, 2013, http://www.theatlantic.com/health/archive/2012/04/the-limits-of-genetic-testing/255416/
 G. Moore, "Darwin and the demon: Innovating within established enterprises," Harvard Business Review, July 2004, https://hbr.org/2004/07/darwin-and-the-demon-innovating-within-established-enterprises
¹⁰ J. Rivera, "Gartner's 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business," The Gartner Group, http://www.gartner.com/newsroom/id/2819918.
¹¹ M. Wheatley, "Gartner’s Hype Cycle: Big Data’s on the slippery slope," SiliconANGLE, August 19, 2014, http://siliconangle.com/blog/2014/08/19/gartners-hype-cycle-big-datas-on-the-slippery-slope/.
¹² G. Langer, "Growing Doubts About Big Data," ABC News, April 8, 2014, http://abcnews.go.com/blogs/politics/2014/04/growing-doubts-about-big-data/.
¹³ T. Harford, Big data: are we making a big mistake?" Significance, The Royal Statistical Society, December 2014, pp. 14 - 19,  http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2014.00778.x/abstractFinancial Times, March  28, 2014, http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html.
¹⁴ C. Anderson, "The end of theory: The data deluge makes the scientific method obsolete," Wired, June 23, 2008, http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory.
¹⁵ K. N. Cukier and V. Mayer-Schoenberger, "The rise of big data," Foreign Affairs, May/June 2013, pp. 28-36, http://goo.gl/3tGMZc.
¹⁶ J. Ginsberg, et al, Detecting influenza epidemics using search engine query data," Nature, November 19, 2008, http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html, (republished by Google to http://goo.gl/Ly0bwR).
¹⁷ H. Hodson, "Google Flu Trends gets it wrong three years running," NewScientist — Health, March 13, 2014, http://www.newscientist.com/article/dn25217-google-flu-trends-gets-it-wrong-three-years-running.html#.VOQZq7DF-oU.
¹⁸ D. Crow, "Digital divide exacerbates US inequality," Financial Times, October 28, 2014, http://www.ft.com/intl/cms/s/2/b75d095a-5d76-11e4-9753-00144feabdc0.html#axzz3J94dMqob.
¹⁹ "US Smartphone Penetration by Ethnicity, Q12012," Neilson, June 1, 2012, accessed from Beehive Group, http://beehive.me/2012/06/us-smartphone-penetration-by-ethnicity/
²⁰ "U.S. smartphone penetration by age group, 2Q2014," MarketingCharts, http://goo.gl/VCvjDt.
²¹ T. Hastie, et alThe Elements of Statistical Learning, second edition, Springer, 2009, http://goo.gl/ipNrMUhttp://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf.
²² T. H. Davenport, "Keep up with your quants," Harvard Business Review, July 2013, http://goo.gl/BrWpD1.
²³ D. R. Cox and D. V. Hinkley, Theoretical Statistics, Boca Raton, FL: CRC Press, 1996, http://goo.gl/1zw8H0.
²⁴ J. Timmer, "Why the cloud cannot obscure the scientific method," Ars TecnicaJune 25, 2008, http://arstechnica.com/uncategorized/2008/06/why-the-cloud-cannot-obscure-the-scientific-method/.
²⁵ S. Lavalle,  et al, "Big data, analytics and the path from insights to value," MITSloan Management Review, Winter 2011, http://goo.gl/8RSn5H.
²⁶ N. Teicholz, "The government’s bad diet advice," Washington Post, February 20, 2015, http://goo.gl/aeyNwC.







© The Quant's Prism, 2015