Saturday, December 20, 2014

Being scientific about the business requires more than treating data scientifically.

Extracting value from big-data analytics is about changing the way in which business is done.¹ Intuition and tradition must give way to evidence and rigor. Kotter (Leading Change²) may actually tell us at least as much about how to achieve this as Bartlett (Practitioner's Guide to Business Analytics³).

Some analysts assert that "Big data business benefits are hampered by 'culture clash'."⁴ Allegedly, traditional "requirements-driven" approaches to managing enterprise IT systems impede "opportunistic analytics and exploring answers to ill-formed or nonexistent questions." Traditional IT bureaucracy purportedly suffocates entrepreneurial creativity of a "new breed" of big-data specialists.


I previously wrote that extracting value from big-data analytics requires us to simultaneously apply scientific methods to both our data and the business questions we seek to answer. I observed that data-science methods — by which big-data analytics yield information bearing economic value — can be classified either as data-centric or business-centric. The business-centric methods emphasize scientific methods for business questions. Data-centric methods narrowly focus on the data.

Guiding an organization's ubiquitous adoption of data-driven practices leads us beyond the scientific method for business questions. We must be systematic and methodical about the business itself. We consider the entire span of businesses attributes over which we can reasonably assert control. The following discussion summarizes a pair of frameworks by which we guide the application of opportunistic analytics by entrepreneurial big-data specialists.

I first describe a framework originally formulated to guide the adoption of Service-Oriented Architecture (SOA).⁵ We decompose the business into separable components. I illustrate using Michael Porter's Value-Chain Framework⁶ as a starting point. A second framework — characterizing business functions' operational criticality — focuses our attention on the risk of insufficiently-managed change. 

These two methods establish boundaries and conditions under which experiments in opportunistic analytics may be applied to core-business functions. They give decision makers control points by which to manage the pursuit of big-data opportunities. 

Business-capabilities analysis — identifying targets for big-data experimentation.

The scientific method makes extensive use of models. Developing a model for the organization often represents the first step in identifying and prioritizing opportunities for big-data analytics. Academia and the consulting industry have produced a variety of useful approaches to the modeling of organizations. Examples include:
  • Enterprise architecture methods (e.g., The Open Group Architecture Framework (TOGAF)⁸) characterizes organizations in technology-centric terms;
  • Business-Process Modeling takes a functional view of organizations;
  • Lean/Six-Sigma emphasizes quality and consistency of business outputs; and
  • Quantitative business models characterize financial and economic attributes of business functions.¹⁰
Figure 1 illustrates a functional model for a manufacturing business. It decomposes a value-chain⁶ view into increasingly more-granular functions. This approach was developed to identify and prioritize opportunities for technology investments leading to improved business performance. Consulting organizations, exemplified by IBM,⁹ have more formally developed similar methods.


Figure 1  — A Business-Capabilities Analysis (BCA) prioritizes and establishes the scope for opportunistic experimentation with big-data analytics. (From Merrifield.

The business' functional breakdown in Figure 1 is annotated with two pieces of information. A label at the top of each box indicates the importance of the function to the business. This rating might be assigned based on contribution to competitive advantage, essentiality to business operations, or other criteria.

The boxes themselves are color-coded. The coloring indicates how well the function is performing. This performance characterization could compare the individual business against industry benchmarks. It might also indicate disproportionate consumption of management attention or other resources.

Priorities for experimentation with big-data analytics are assigned based on these two criteria. Big-data analytics resources — of which scarce, high-priced talent is often the principal component — are apportioned judiciously. High-value business functions requiring attention are usually assigned the highest priorities.


Characterizing the risk of change associated with new applications of big-data analytics.

Change to an organization is often difficult — even, frequently, turbulent. Introducing new big-data analytics business functions requires organizations' members to change how they think about the business. The maxim "If it ain't broke, don't fix it" usually represents good advice. 

Figure 2 contains a framework for thinking about how business functions contribute to an organization's mission. This framework was originally developed to characterize IT's contribution to a business en total. It is equally valid for thinking about business analytics. 

Figure 2 — A strategic-impact grid gives us a rigorous framework for characterizing the risk of experimentation and change associated with introducing new big-data analytics to a business function. (From Nolan.¹¹)

Information quality provides the big-data-analytics equivalent to the IT-reliability factor. Factory- and strategic-mode business functions demand high-confidence, relevant data. Figure 2 emphasizes the availability of IT services in general. Operational capabilities using outputs from big-data analytics must also provide high-confidence, actionable information.

I provided another 2×2 grid for characterizing data quality in Figure 2 of my previous installment on treating business questions scientifically. Factory- and strategic-mode functions demand outputs from big-data analytics that fall in the "Differentiating Information" quadrant. Even of the service based on big-data analytics is continuously available, lapses in information quality may lead to catastrophically flawed decisions.

Characterizations based on Figure 2 in this installment might also inform characterization of business functions the framework of Figure 1. The vertical axis in Figure 2 might inform our business-value assignment. The need for new business analytics — the horizontal axis of Figure 2 — might inform the performance-spectrum assignment.


Experimentation with big-data analytics as the launchpad for business innovation.

We began this discussion with the observation about an alleged conflict between traditional enterprise-information management — data warehouses and business intelligence — and and the "new breed" of opportunistic big-data specialists. In reality, both are required. Innovation based on big-data analytics requires a disciplined, managed approach to change. 

Changes to factory- and strategic-mode business functions from Figure 2 must be made cautiously. The business depends upon their reliability. Thomke and Manzi¹² describe a disciplined approach to business experimentation. Their examples extend beyond opportunistic big-data analytics. Nonetheless their approach also applies here. 

We perform big-data analytics experiments through controlled, non-disruptive pilots. Experimenters are free to explore new opportunities using business-representative data. Their activities are isolated from core business operations, particularly those that are factory- or strategic-mode in nature.

But what about scenarios in which big-data analytics identify opportunities for disruptive innovation? 

This question addresses scenarios leading to extreme change. The innovation opportunity may require a business to begin cannibalizing existing revenue streams before competitors exploit the opportunity first.¹³ 

Two principles guide the decision to embark on such a radical course. First, we previously observed that the information from big-data analytics assumes value commensurate with the economic returns from its exploitation. This is a foundational tenet of Business Analytics. 

Capturing this value may require re-orchestration of activities by the entire organizationOperationalizing new big-data analytics capabilities may lead to reconfiguring the business model exemplified in Figure 1. "New-breed" big-data specialists cannot do this by themselves. Taking action to capture value resides beyond the scope of most analytics specialists' activities.

Second, big-data specialists do not by themselves make the decision to embark on major organizational change. This is particularly the case for decisions about whether to divest or unwind revenue streams that are highly vulnerable to market disruption. These decisions are made by the organization's leadership.  

"New-breed" big-data specialists aid in identifying these scenarios. They serve as trusted advisors to the organization's leaders about how best to exploit the opportunities. They also assist in working through practical details of operationalizing the opportunities they discover. The results of their work — particularly when applied to factory- and strategic-mode business functions — ultimately however become subject to the disciplined governance of "traditional" data-warehouse and business-intelligence functions.

References

¹ S. LaValle, et al, "Big Data, analytics, and the path from insights to value," MITSloan Management Review, Winter 2011, pp. 21 - 31, http://goo.gl/MY5VK6
² J. P. Kotter, Leading Change, Boston: Harvard Business Review Press, 1996, http://goo.gl/EqglOJ
³ R. Bartlett, A practitioner's guide to business analytics, New York: McGraw-Hill, 2013, http://goo.gl/o6dTOS.
⁴ F. Buytendijk and D. Laney, "Big data business benefits are hampered by 'culture clash'," The Gartner Group, Report G00252895, September 12, 2013, http://goo.gl/8GLRgk.
⁵ "New to SOA and web services," IBM DeveloperWorks, https://www.ibm.com/developerworks/webservices/newto/.
⁶ M. Porter, Competitive Advantage of Nations, New York: The Free Press, 1990, Figure 2-3, p. 41, http://goo.gl/Qn1xGn.
⁷ R. Merrifield, et al, "The next productivity revolution," Harvard Business Review, June 2008, https://hbr.org/2008/06/the-next-revolution-in-productivity.
⁸ "The Open Group Architecture Framework," version 9.1, The Open Group, http://www.opengroup.org/togaf/
⁹ "Component business models: Making specialization real," IBM Institute of Business Value, http://goo.gl/QhucjH.
¹⁰ J. Tennant and G. Friend, Guide to Business Modelling, third edition, London: The Economist Newspaper, 2011, http://goo.gl/b36blO.
¹¹ R. Nolan and F. W. McFarlan, "Information technology and the board of directors," Harvard Business Review, October 2005, http://goo.gl/AqCdLK.
¹² S. Thomke and J. Manzi, "The discipline of business experimentation," Harvard Business Review, December 2014, pp. 70 - 79, https://hbr.org/2014/12/the-discipline-of-business-experimentation.
¹³ C. M. Christensen, The Innovator's Dilemma, Boston: Harvard Business School Press, 1997, http://goo.gl/kg2u9Y.

© The Quant's Prism, 2014

Saturday, October 4, 2014

Getting value from Big Data requires the same discipline as any other business investment.

Reports on the surface might lead us to believe that the Big Data "revolution" brings alchemy in our time. A look below the surface suggests however that the path to Big Data value is not without challenges.  For example:

  • Information Week recently reported that "the average company right now is getting a return of about 55 cents on the dollar" invested in Big Data analytics.  
  • A related survey by Wikibon found that "46% of Big Data practitioners report that they have only realized partial value from their Big Data deployments," while, An unfortunate 2% declared their Big Data deployments total failures..." 
  • Information Week also summarized a survey by IDG and Kapow Software revealing that, "...just 23% of <business> leaders see big data projects as 'successes' thus far."
Moreover, organizations find that even measuring Big Data return on investment (RoI) remains challenging.

Gartner's "Emerging Technologies" Hype Cycle Report places Big Data at the Peak of Inflated Expectations. The consultancy also predicts that five to ten years remain before Big Data reaches the "Plateau of Productivity" stage. This positioning is telling.


What then are the obstacles to big-data value? Analysts suggest that talent scarcity, technology maturity, and less than perfect alignment between Big Data initiatives and business priorities lie among the culprits.  I focus here on thee other barriers:
  • Failure to treat business issues scientifically;
  • Limitations in data itself; and
  • Cultural disinclinations to fact-based decision making.

Failure to treat business issues scientifically.

Big data simply represents the latest wave of business technology. Authors of the IT Infrastructure Library™² observed the influence of organizations' technology management style on their ability to derive value. Organizations that focus on technology's strategic contribution get more out of it. Ross, et al, made similar observations.  Technology can either "fossilize" an organization, or it can provide a platform for "dynamic capabilities."⁴

The same is true for analytics.  Lavalle, et al,⁵ published what remains one of the more useful roadmaps to Big Data value.  Davenport, et al,⁶ echo similar concepts. Three of Lavalle's five central recommendations apply here:
  • First, think biggest — Focus on the highest-value opportunities; and
  • Start in the middle — Within each opportunity, start with questions, not data; and
  • Build the parts, plan the whole — Define an information agenda to plan for the future.
Lavalle and Davenport separately portray analytics — in not so many words — as a scientific approach addressing business challenges.

What is the state of the practice of Big Data analytics today?  A twenty-year-old issue of Scientific American⁷  gives us a thought model. Technology professions proceed through stages of maturity from craft, to commercialization, to professionalism.  The emergence of standard, repeatable practices represents a key milestone.

Sets of best practices for Business Analytics have begun to emerge. Figure 1 summarizes four of them.  Three are data-centric methods: Patterns for applying the scientific methods to data. Two of these were developed by vendors to aid in use of their analytics software. The third originated from academia.  Davenport¹¹ describes a fourth, business-centric method.

Figure 1 — A series of best practices methods for data analytics have emerged.  The data-centric methods at best address business problems only obliquely.

What does this say about the state of the Business Analytics profession? Many of the leading practices — only incidentally at best — focus analysts on business problems. Would-be adopters of Big Data analytics apply scientific approaches to data, but not always necessarily to business questions they are asked to answer.

How does lack of scientific approaches to business impede RoI from Big Data investments? Figure 2 illustrates. Haphazard treatment of business priorities by analytics practitioners leaves organizations in the "Sandbox" or "Science Project" modes.


Figure 2 — RoI from Big Data analytics investments depends on scientific treatment of business as well as the data.

Getting value from Big Data depends on scientific treatment of the business as well as the dataOrganizations' due diligence in their Big Data investments demands the the scrutiny and discipline as for any other major business investment. No organization would launch a marketing strategy without first considering how it intends to differentiate itself. Investing in Big Data demands the same focus.

Limitations in the data themselves.

A recent contributor to Harvard Business Review postulates that "...advanced algorithms can take a near-unlimited number of factors into account...."¹² This perception about Big Data walks a tenuous fine line.  The temptation to attribute alchemic abilities to Big Data analytics sometimes appears pervasive. Fundamental limitations underly data science — on which Big Data analytics is based — in the same way that weight and drag constrain aerodynamics.

But what about the Big Data "revolution"? Experiences with and perceptions of consumer data mining might produce misleading intuition. Consumer analytics is not a haphazard "search for serendipity" in a random "stack" of Big Data. Our consumer devices and applications are carefully instrumented to collect precisely the information that predicts our spending behaviors.

What kind of data limitations can impede analytics RoI?  This blog has already considered two. I illustrated how simply adding more data does not increase the information analytics can produce. Adding more factors to a model only helps if they contain useful information that the previous ones lacked.

Also, the lack of data quality degrades analytics outputs.  Analytics models are subject to the "Garbage-in/Garbage-out" syndrome.  I looked at five data quality attributes that influence the usefulness of analytics outputs. Neglecting data quality potentially lands organizations in the "Lies, damn lies, and statistics" quadrant of Figure 2.

Finally, analytics may at best yield information that is already known. Organizations may not need Big Data to improve operational optimization, for example. Responding to competitive market forces may have already driven them to the optimize their operations. Analytics would — under such a scenario — simply validate that the organization has already achieved optimality.

What then do we do about our data?  We adopt an approach that applies the scientific method jointly to our business and our data. A disciplined, scientific method leads us to the truth about the value of opportunities hidden in our data. It also reveals new information-collection needs.

Cultural disinclinations to fact-based decision making.

Lavalle⁵ observes that "the adoption barriers that face most organizations are managerial and cultural rather than related to data and technology."⁵ Deploying technology is the investment side of the Big Data RoI equation. Driving return often involves changing the business.

The Data Warehouse Institute — a an organization of technical professionals dedicated to Enterprise Information Management — surveyed its membership about the closely related imperative of aligning management of data to strategic business priorities.¹³ The first technology related issue did not appear until fifth place among their responses.  The four highest-ranking responses were:
  • Data ownership and other politics;
  • Lack of governance or stewardship;
  • Lack of business sponsorship; and
  • Unclear business goals for data.
Effectively executed Big Data analytics — through joint scientific management of the business as well as the data — can provide the "What" part of answers to pressing business questions. I illustrated by showing how customer analytics might provide the basis for a targeted marketing campaign.

The "So What" part of business questions' answers addresses how the organization responds. Information's economic value is based on the value of the opportunity arising from its optimum application. Optimally applying information often requires the people in an organization to change how they operate. They must change what they measure about their business, and what they do with those measurements. These changes are scary and hard!

Kotter¹⁴ may in the end tell us as much about driving RoI from Big Data as Hastie¹⁵ or any other technical text. Harvesting analytics' value depends as much on changing an organization's behavior as it does on tools and technology. 



References

¹ Gartner Group's 2013 installation of "Hype Cycle for Emerging Technologies,"  http://goo.gl/a4xlEY.
² ITIL™ Service Operation, U.K. Office of Government Commerce, 2007, Figure 5.1, p. 81.
³ J. W. Ross, Peter Weill, D. C. Robertson, Enterprise architecture as strategy, Boston:  HBR Press, 2006, http://goo.gl/B7J5P8
⁴ C. E. Helfat, et al, Dynamic capabilities, Wiley, 2007, http://goo.gl/Gn6cjC.
⁵ S. Lavalle, et al, "Big data, analytics, and the path from insights to value," MITSloan Management Review, Winter 2011, http://goo.gl/8RSn5H.
⁶ T. H. Davenport, J. G. Harris, and R. Morison, Analytics at work, Boston: HBR Press, 2010, http://goo.gl/olZkKm.
⁷ W. W. Gibbs, "Software's chronic crisis," Scientific American, September 1994, pp. 86 - 95.
⁸ IBM SPSS Modeler CRISP-DM Guide, IBM Corporation, http://goo.gl/4Gg7Pa.
⁹ Enterprise Miner™ SEMMA Method, SAS Corporation, http://goo.gl/8ig4RX.
¹⁰ "The KDD Process for Extracting Useful Knowledge from Volumes of Data," Communications of the ACM, November 1996, http://goo.gl/s4gvDd.
¹¹ T. H. Davenport, "Keeping up with your quants," Harvard Business Review, July-August 2013, http://goo.gl/BrWpD1.
¹² T. C. Redmond, "Algorithms make better predictions — Except when they don't," HBR Blog Network, September 17, 2014, http://goo.gl/n0kPJd.
¹³ "TDWI Technology Survey: Enterprise Data Strategies," Business Intelligence Journal, Vol 18, No. 2, March 2013.
¹⁴ J. P. Kotter, Leading change, Boston: HBR Press, 1996, http://goo.gl/EqglOJ.
¹⁵ T. Hastie, R. Tibsharini, and J. Friedman, Elements of statistical learning, NY: Spinger, 2009, http://goo.gl/23tclz.

© The Quant's Prism, 2014

Sunday, September 21, 2014

Extraction of business value from your data depends critically on data quality.

We assign economic value to information commensurate with how much better we can do with it than we can without it. This summarizes the point of my previous discussion on Information Economics. Information economics represents the foundation of business analytics.

My simple demonstration in that installment featured an element of uncertainty. I stated in my introduction to data science that data science — and business analytics, by extension — can never completely remove uncertainty from business decision-making. Separating uncertainty into resolvable and non-resolvable components is the best it can do.

We can do things about resolvable components of uncertainty underlying our decision-making. We can resolve uncertainty by:
  • Judiciously selecting the best information on which to base our decisions; and
  • Maximizing the quality of the information actually used.
I demonstrated in my previous installment that more data is not always better. Bringing more data into our decision-making only helps if the new data gives us information that we do not already have.  I call attention here to the quality of data.

I begin this installment by offering a definition of data quality. I then demonstrate the assignment of economic value to quality.  I use an "elementary game" illustration along the lines of Information Economics.  I conclude with a foreshadowing of how data quality will influence subsequent installments in this series.


What do we mean by data "quality"?

Quality is one of those words that we use extensively without defining it. This blog's scientific approach requires that we define things precisely. I cannot formulate and scientifically test hypotheses about things that are not precisely defined.

An abstract thinker, I am fond of Robert Pirsig's¹ definition of quality. Pirsig fused the idea of esthetics — the study and characterization of beauty — with technical conformity. The esthetic aspect makes applying the scientific method more challenging.  (Pirsig might be challenged by my imposition of this constraint on my line of inquiry.)

Quality management occupies a prominent role in the Program Management profession. The Project Management Book of Knowledge (PMBOK)² contains extensive sections devoted to quality management. The PMBOK itself lacks an obvious, explicit definition of quality. 

A leading study guide for the Project Management Institute's (PMI) Project Management Professional (PMP) exam offers a passing definition: "Quality is defined as the degree to which the project fulfills requirements."³ This then leads us to the concept of quality as conformity, or compliance. This gets us closer to a useful definition of quality.

So, what constitutes quality in data? GS1 identifies five key attributes of data quality. GS1 is a non-profit organization dedicated to the design and implementation of global standards and solutions to improve the efficiency and visibility of supply and demand chains globally and across sectors. Their mission — focused on facilitating information charing — positions them well to speak with authority. 

What are these five attributes?  I summarize them here.  I also offer both business and technical examples.  I used slightly different terminology.  For ease of memory, these are the "Five Cs of data quality."
  • Completeness.  A quality data set contains all of the expected information on which a business decision is based. A database administrator (DBA) would focus technically on whether all of the required fields in each record are populated.
  • Conformity.⁴ A data set must, in its entirety, conform to a well defined semantic and logical framework. A DBA might talk about a "data schema", here. The schema captures the logical part. Business decision-making takes us beyond the DBA. Data derived from multiple sources must allow "apples versus apples" comparisons. We sometimes refer to this as "semantic" conformity.
  • Correctness.  Data is used to represent facts, or states of nature. Correct data represents facts accurately. A DBA will focus on measurement or data entry errors. Correctness is usually an attribute derived from the data source. It is, consequently, difficult to define data systems capable of internally checking data correctness.
  • Currency. We desire to make decisions based on data representing the actual state of nature. A designer of a data management system will assign version numbers and time stamps. Business decision makers may also concerned about whether observations are made frequently enough to capture rapidly changing circumstances.
  • Confidence.⁵  Data confidence is related to "veracity", one of the key characteristics IBM assigns to "Big Data." A business decision maker considers whether the data source possesses the authority to make the assertion represented by the data. A designer of a data management system will handle this through seeking policies that designate data sources as "authoritative."
These five attributes — our "Five Cs of Data Quality" — provide us with a holistic framework for thinking about and managing data quality.  I now illustrate the Information Economics of data quality.

Illustration of how data quality affects the value of information.

Market analytics gives obvious examples to illustrate opportunity costs resulting from poor data quality. Let's extend the BMW dealership illustration used a couple of installments ago. I first used it to show the economic value of information. I'll show here how poor data quality dilutes that value.

Let's say that my market size is about 100,000 potential customers. I define my market as households with net income in excess of $85,000 within a specific geographic radius. Market research gives me the following features about households in my market:

  • Household incomes, conforming to a power-law statistic distribution;
  • Age of oldest current automobile, exponentially distributed;⁷ and
  • Automobile brand preferences, uniformly distributed.
Past history gives me a model I use to relate these features to the probability that I can sell them a new BMW. Figure 1 shows the results of the model.  It ranks households according to my estimate of the likelihood of sale.  It contains two curves, an "Actual", and an "Underestimated."  I'll turn to the "underestimated" curve shortly.

You see in Figure 1 that the top decile — ten percent — contains most of the effective market. This is the households that are really likely to buy from me. So, my effective market is actually only about 10,000 households. If the average sale generates $50,000, this market is valued at $500 million.
Figure 1 — Predictive analytics tells me which households are most likely to buy, based on income, age of current vehicle, and brand preference.

Let's say that I previously used targeted marketing to improve the sales likelihood by an average of ten percent for households in in the 65% to 85% range of probability of purchase. These households are the best candidates for a targeted marketing campaign. There are 1,669 households in this group. Targeted marketing increases the size effective market by approximately 1,660 households.

Let's now turn to data quality. Unbeknownst to me, about a quarter of my customer preference ratings are underestimated by 20%. This is a data correctness issue.  My perception of the market is, as a result, represented by the "Underestimated" curve in Figure 1.

What does this do to my business?  The curve in Figure 1 only appears to shift a small amount.  But the real effect is to reduce the size of my effective market is reduced by 230 households. This amounts to around $11.5 million in reduced revenue opportunity. 

Some of this loss arises from misidentifying the best potential customers for my targeted marketing. On one hand, I'm not applying targeted marketing to customers who might benefit from it the most.  On the other, I expend resource on others who are less likely to be swayed. 

Poor data quality leads to misdirection of my targeted marketing efforts. This is an information economics issue. The utility of my information — how much better I can do with it than without it — is directly related to data quality.


What, then, does data quality mean for enterprise business analytics?

We see here that quality influences the value of information. Data quality — characterized by the "Five Cs" — does not necessarily occur automatically. That quality is accompanied by cost is unsurprising. 

The PBMOK briefly addresses the cost of quality.⁸ These concepts extend to data quality management. Two data quality costs arise:
  • Costs of conformance are accrued through expenditure of effort to explicitly address data quality objectives; and
  • Costs of nonconformance arise from adverse results due to shortfalls in data quality.
My illustration above highlights an example of the second. Data quality nonconformance dilutes the economic value of information.

Enterprises should strive to balance these two costs. In other words, they should expend efforts towards data quality commensurate with the information value requirements of a specific business context. Information economics provides the basis for business case analysis about data quality.

Business critical information supports decisions about issues for which considerable value is at stake. Data for these applications demand high quality.  For example:

  • Decisions about major investments often placing substantial amounts of capital at risk; and
  • Financial reporting must be governed by financial controls regarding which legal requirements for attestation apply.⁹
An entire industry developed to provide specialized data management systems to address auditing and financial reporting requirements. These systems are Enterprise Resource Planning (ERP) systems. 

At the opposite end of the spectrum, inserting a single targeted advertisement into an individual user's web browsing session costs very little. Netflix movie recommendations or Amazon book recommendations each similarly cost very little. All targeted ad placements or consumer preference recommendations must collectively achieve an average accuracy. 


References.

¹ R. M. Pirsig, Zen and the art of motorcycle maintenance, New York:  Harper Collins, 1974, Kindle Edition, p. 204, http://goo.gl/si1ayP.
² Guide to the project management book of knowledge (PMBOK), fourth edition, Newtown Square, PA:  Project Management Institute, Chapter 8, http://goo.gl/anjxhd.
³ Rita Mucahy's™ PMP® Exam Prep, seventh edition, RMC Publications, 2011, p. 263, http://goo.gl/k0ml4D.
⁴ GS1 talks here about "consistency". I take the liberty combining two factors, consistency and "standards-based."
⁵ GS1 does not address data confidence.
⁶ N. N. Taleb, The black swan, New York: Random, http://goo.gl/Ermy9s. 
⁷ L. Kleinrock, Queuing systems, Volume I: Theory, New York: Wiley, 1975, pp. 65-71, http://goo.gl/Dz8mVE.
⁸ Guide to the project management book of knowledge (PMBOK), fourth edition, Newtown Square, PA:  Project Management Institute, §8.1.2, http://goo.gl/anjxhd.
e.g., "Internal control over financial reporting in exchange act periodic reports of non-accelerated filers," Security and Exchange Commission, 17 CFR Parts 210, 229, and 249, http://www.sec.gov/rules/final/2010/33-9142.pdf.

© The Quant's Prism, 2014

Monday, September 8, 2014

Analytics demonstration: How much information does your data contain?

Our current big data buzz is accompanied numerous points of excitement and controversy. The "more data" versus "better algorithms" is prominent among these. This debate sometimes resembles an old Miller Lite beer commercial

Multiple business analytics luminaries have weighed into the conversation.  These include:
  • Omar Tawakol, CEO of Oracle's data marketing operation BlueKai, concludes in Wall Street Journal's AllThingsD that "If you have to choose, having more data does indeed trump a better algorithm;"
  • Gartner's Andrew Frank emphasizes the nuance by asserting that "if you have to choose, maybe you’re on the wrong track;"
  • Information Week's Larry Stofko leads the conversation away from the dichotomy, towards "first we must make small data seamless, simple, and inexpensive;" while
  • Jeanne Ross, et al, assert in Harvard Business Review that the question is irrelevant: That "Until a company learns how to use data and analysis to support its operating decisions, it will not be in a position to benefit from big data."
This obviously is a very complicated issue.  It defies the "Tastes Great-Less Filling" dichotomy.  Very few initiatives provide simple, automatic Return on Investment (RoI). Big data analytics is no exception. 

More data only delivers marginal value if that data brings new information. I provide a technical demonstration in this installment. I try to show this in terms that don't require a mathematical PhD. 

I demonstrate this concept using a data set from the U.S. Bureau of Labor Statistics (BLS). This is the first of a series of quantitative demonstrations. The BLS data offers numerous opportunities for interesting data science demonstrations. I plan to spend much of the Autumn tinkering with it.


My data source:  Bureau of Labor Statistics' (BLS) O*Net.

O*Net is a career counseling service provided by the BLS. It is based on Holland's Occupational Themes. Psychologist John L. Holland theorized that our personality preferences dictate the types of occupations we are likely to enjoy the most. The Myers-Briggs Type Indicator (MBTI) and Herrmann Brain Dominance Instrument (HBDI) are based on similar theories.

O*Net users complete a survey of interests and lifestyle preferences.  The O*Net application applies analytics to identify occupations that are likely to be attractive to someone with specific sets interests and preferences.

O*Net's is based on a database that contains tables describing 923 different occupations. They are characterized in terms of:
  • Interests (9 attributes);
  • Knowledge (33 attributes);
  • Abilities (52 attributes); and
  • Skills (35 attributes).
Each attribute contains two numerical values: One for importance and the other for level. So, the model characterizes each of the 923 occupations in terms of 258 attributes — or features.

What makes this data set interesting?  Four factors attract me:
  1. It is a well-structured data set, requiring limited preparation;
  2. It's open, well-documented, and easily accessible;
  3. Some degree of authoritative research underlies the data set; and
  4. It seems to contain an awful lot of attributes by which to distinguish between 923 occupations.
This last factor is the point of the discussion here, today.  How many of these 258 features are really needed to describe 923 occupations.

I demonstrate below that most of the information* about the 923 occupations is contained in a small subset of the 258 features. I do not argue that 258 features is too many to guide job seekers or career changers. I make my point from an information theory point of view: More data attributes do not necessarily yield more information.

How much information do 258 features really contain?

It seems that introducing some geek-speak is unavoidable at this point. My apologies to the less mathematically interested readers. I will return — after a brief plunge into the weeds — to an explanation that is more plain-English.

Principal Component Analysis (PCA) tells us how information* is distributed between the features in a data set. PCA "reshuffles" the features using a mathematical method called Singular Value Decomposition (SVD). The features in our data set are often interrelated to each other in complicated ways. One factor might depend on several others. They contain lots of redundancy!

SVD unwinds all of the features into an "abstract" feature set. The members of this "abstract" set are unrelated to each other.  SVD also tells us how much information each "abstract" feature bears.

This is the point: Our data sets all contain a mixture of true, value-added information and noise. The actual information is usually spread across lots of different features. So is the noise. Data scientists — when they build analytics models — struggle to separate the real information from the noise. When the real information is spread across too many features, it's easier for it to become buried in the noise. The more features the information is spread across, the more diluted it is.

Figure 1 illustrates.  (I used the R open source statistics tool to produce Figure 1.) It shows how the information* is distributed across the 70 "abstract" features. Two of those features contain 95% of the information about how the 923 occupations are related to their associated skills. The remaining 68 features — containing only 5% of the information — are mostly noise.


Figure 1 — Two of 70 "abstract" features from the O*Net occupational skills table contain 95% of the information about how skills are related to occupations.

What does this mean? Most of the 70 "original" features in the skills data set are highly redundant. BLS provides a summary of the Holland model's history. They recently issued a newer version than the one I studied here. This release contains new 126 occupations.  

Are they adding new features to the occupations? I have not done that analysis. New features may be attractive from a career consulting point of view. Figure 1 however suggests that addition of new features should be guided by caution. New features should primarily add new information that is not duplicative of the 258 that are already in the data set.

But, Figure 1 just shows us the Skills features that BLS uses to characterize occupations. What about the rest?  What happens when we put them all together? Figure 2 gives us the answer.

We saw in Figure 1 that two abstract features contained 95% of the skills information. We add Abilities, Knowledge, and Interest in Figure 2. We might expect that four groups of factors would lead to eight information-bearing abstract features. This is not the case! 

Figure 2 — When we combine Skills, Abilities, Knowledge, and Interests, 95% of the combined information is contained in just four of 258 "abstract" features. 

The marginal value of information added by increasing our occupations characteristic data set from 70 factors in Figure 1 to 258 factors in Figure 2 is just two additional information-bearing abstract features. 
The Abilities, Knowledge, and Interests feature sets don't tell us much more than we know from Skills alone. 

We increase our feature set by a factor of about 3.7. We only double the number of information-bearing features. This suggests we have reached some form of diminishing returns

Leave the kitchen sink out of your data strategy for analytics.

In last week's installment I introduced some rudimentary concepts from Information Economics. The value businesses get from information is related to the opportunity that information gives to take more profitable actions. 

By extension, does data bring value? I illustrate above that adding data only increases value if the new data tells us something we don't already know. From an information point of view, Abilities, Knowledge, and Interests don't tell us much more about the 924 occupations in BLS' O*Net database than the Skills features alone. Adding those three data sets increased our feature set from 70 to 258, but only gave us two additional information bearing features.

But handling data is accompanied by cost! Technology vendors call our attention to the plummeting price of storage. Storing information is becoming increasingly cheap. Extracting value from data?  Not necessarily so.

The big data buzz might tempt us to expect that cost-free opportunities exist to extract value from data. We routinely are regaled with stories about Internet companies' use of data mining to illuminate many of the most intimate details of our lives. Google, Facebook, Yahoo!, and others accomplish this through strategic collection of data elements that introduce new information. They appear to have borrowed knowledge from the forensic computer security business to disinter our secrets.

What's the bottom line? Organizations seeking to extract value from data need to focus on value in their use of data. Begin with an Information Economics philosophy. Ask questions like:
  • What information will give me improved economic value?
  • How much value-bearing information is contained in the data that I already have? and
  • How much will handling the data cost to the new value-bearing information that I need?
Next installment:  Elements of the cost to extract information from data.



* I use the term "information" here imprecisely. "Singular values" produced by SVD are not information, technically speaking. They are related to information content. I intentionally commit this error in terminology for convenience of discussion.



© The Quant's Prism, 2014

Thursday, August 28, 2014

Information Economics: The Foundation of Business Analytics.

The technology industry often produces multiple, not-altogether-consistent definitions of the latest "hot" thing. Business analytics is no exception.  Confusion can be the result.

Hence, the intensity of my focus on definitions. Having previously defined "data science", I now drag readers through an exercise in defining business analytics.  At the risk of appearing obsessively compulsive, I repeatedly emphasize the business context.

The charter statement for this blog emphasizes the study of the application data science to business problems.  I seek to apply the scientific method to its practice.
"Data and statistical methods" have become inseparably associated with the "how" of business analytics.  I want want to dig deeper.  

Science is about answering "why." Since business is the domain of interest for business analytics, we should look to economics as a candidate foundational science. In the following, I:

  • Make the case for foundation for business analytics in economics;
  • Introduce a specialized domain of economics on which business analytics is based; and
  • Provide a simplified illustration of its use.
Why, reader, should you care about these things? Big data and business analytics are subjects of many bold claims regarding their transformational abilities.  Some of these claims are valid, and some are not.  I seek here to help you separate the science from the alchemy.

What is business analytics?

Service oriented architecture (SOA) — a source of significant tech-industry buzz during the last decade — provides a case study in definitions.  Distinct definitions appeared to arise for each stakeholder class. The Open Group — a non-profit organization promoting open standards for technology view — offers two definitions of SOA.  Software vendors tend to emphasize the key technology components.  For SOA, those are an Enterprise Service Bus (ESB) and a services registry.

Merrifield, et al,¹ identified the business payoff for a SOA approach to strategic technology management. SOA promises a cost-effective approach to mass customization of information technology. They describe a SOA planning method. But they did not explicitly define the practice.

Enterprise architecture (EA) provides another example.  I like Gartner's definition for two reasons:
  • The veracity of its source (i.e., Gartner said it); and
  • Its focused on "enterprise" in the business sense, independent of the technology.
The technology community captured the EA term to connote the architecture of the IT infrastructure — for either an organizational enterprise or for an individual system. Technology consulting group Forrester Research invented a new term, Business Architecture, apparently in response. Ross and Weill² established the need for a business-centric definition.

Why this circuitous path?  I want to make the case for an economics foundation for business analytics. Getting definitions right is important to my case. Statistical (and deterministic) models, modeling tools, and enterprise information management technologies are "how" business analytics is done.  Science seeks to answer the question, "Why"?

The important point here is that we focus business analytics on answering questions that lead to measurable, net-positive business outcomes.  We seek a scientific underpinning from which to achieve this objective.

What, then, does this mean for the discipline of business analytics? I continue with the pattern of a business-centric perspective. I also want to define it as precisely as possible. Wikipedia, the reflexive "go to" source, offers a pretty good definition:
Business analytics (BA) refers to the skills, technologies, practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning. Business analytics focuses on developing new insights and understanding of business performance based on data and statistical methods.
This definition borrows from Bartlett³ (whom I have yet to read, but have added to my Kindle wish list).  Davenport⁴ defines the payoff — "improving performance in key business domains" — without explicitly providing a definition.

So, business analytics is about quantitative characterization of business performance. What then is business about?  Those who have sat through an MBA program might observe that the majority of the curriculum is derived from economics and its applications. An economic foundation for business analytics therefore seems reasonable.

Hence, an economics grounding for business analytics.  I preserve here the distinction between business analytics and econometrics.  The two disciplines use many of the same tools. Business analytics focuses however on a distinct organization.  It arguably constitutes a subset of econometrics.


The foundation of business analytics

Information economics is business analytics in its must fundamental form. Information economics is the science of assigning economic value to information.  It combines principles from the following disciplines:
  • Game theory, by which economic transactions are defined and modeled;
  • Information theory, with bases both in engineering and psychology disciplines; and
  • Microeconomics.
Figure 1 illustrates. 

Figure 1 — Information Economics resides at the intersection
four more familiar disciplines.

Practitioners of Info Economics employ a clearly defined toolset.
  • Information modeling,⁵ borrowed from information theory, precisely represents the distribution of elements of information among participants in an interaction;
  • Game theory contributes pattens for archetypical transactions between participants in an information exchange; and
  • Microeconomics provides bases for economic valuation of elements of information involved in a transaction between counterparties.
The presence of uncertainty in economic transactions introduces probability theory as well.

Information economics provides the foundation for many well known theories about the operation of financial markets.⁶  The interplay between bid-ask prices in a financial exchange, for example, telegraphs considerable information about counterparties' intentions and abilities without explicitly "showing their hands."  The Efficient Market Hypothesis finds partial justification in Info Economics.

Marketing economics is replete with examples.  Applications occur of course in other business disciplines.  For example, information economics can inform investment decision making.  Its principles can also guide what aspects of operational cost and efficiency are most worthy of measuring.

As an aside, the similarities between information economics and real options⁷ are striking. Real options theory assigns economic value to flexibility in making investment decisions.  I leave that discussion for a future installment.


A simple, contrived illustration of assigning economic value to information†

My illustration here is based on the "Elementary Game,"⁸ one of the simplest models from game theory. It resembles the Binary Symmetric Chanel (BSC). I first saw the BSC in communications courses while studying electrical engineering.  Info economics and communications theory (the engineering variety) share roots from information theory. That their toolsets resemble each other does not surprise me.

Figure 2 — a BSC illustration — gives us a passable representation of the "Elementary Game." (Note:  Texts in communications theory⁹ and info economics share the "Alice" and "Bob" notation.) The a priori events appear on the left-hand side.  There is some probability of either of two events occurring.  "Alice" initiates one event or the other.  

"Bob" receives a "signal" indicating — with a probability p that it is correct — which event "Alice" effected.  He therefore views the event from an a posteriori perspective.  Based on knowledge of the a priori probability of what "Alice" did, the probability that the signal is correct, and the cost/benefit of either of two resulting courses of actions, "Bob" must decide what his optimum next step is.


Figure 2 — The "Elementary Game" from game theory resembles the Binary Symmetric Chanel (BSC), a basic building block of communication theory.  (Source: Wikipedia, http://en.wikipedia.org/wiki/Binary_symmetric_channel)

Let's see the "Elementary Game" in action. Say that I'm a BMW dealer. I operate in a market that generates 10,000 sales annually.  I capture an average of 1,000 of those sales — or a 10% market share. Sales produce an average of $50,000 in revenue.  I have traditionally used mass media — broadcast and newspapers — for my advertising.


Let's now say that I can identify decision factors — information — that influence buyers' decisions about whether and from whom to make a purchase of a new car in my market segment.  These factors might include:
  • Capacity to make the purchase (e.g., disposable income);
  • Brand preferences; and
  • Age of current vehicle;
among other factors. I can use this information as the basis for a targeted advertising campaign that — with probability of 25% — increases my market share to 15%. Assume that (in order to keep this example simple) the targeted ad campaign costs the same as the mass media campaign.

How much is this information worth?  Information economics defines the value of information as:  "...the increase in utility from receiving the information and from optimally reacting to it."⁸  So, without the information I can take a course of action leading to one outcome — a specific revenue level in our case.  Given the information, I can make a decision to pursue an alternative course of action.  This alternative leads to a different outcome. I characterize my two alternatives using the same measure.

The increase in utility in our example is change in revenue realized from a targeted ad campaign. From elementary probability theory,

ΔRevenue = Pr{ΔSales} × ΔSales × Average revenue/sale
             = 25% × 500 × $50,000 
≈ $6,250,000.

Information that can — with probability 25% — increase my market share by from 10% to 15% is worth about $6 million to me!  This trivially simple illustration demonstrates the power of Google's business model.

Information economics at work

I illustrated a scientific approach — based on Information Economics — to assigning value to two specific elements of information in a specific business context. These essential elements of information are:

  • What change in market share might I be able to effect with a targeted ad campaign; and
  • What is the probability that my targeted ad campaign will produce that result.
This gives me an economic value of those two information elements.  I apply business analytics — data and statistical methods, tools, data scientists — to obtain the answers to these specific questions.  If the cost of getting this information is less than its economic value, then applying business analytics here yields a net-positive economic benefit.

This illustrative example admittedly oversimplifies things. Business decision makers should base their decisions on a range of probabilities. Few business questions lead to discrete, binary answers.

So, what does all this mean?  First, information economics provides a scientific approach to business case analyses for business analytics initiatives.  The value of business analytics is measured by its economic returns.  We now have a rigorous approach to determining the "goodness" of business analytics initiatives.

Second, this leads to criteria for a strategy for adoption of business analytics by organizations.  Lavalle, et al, advise would-be data-driven organizations to, "Start with questions, not data!"¹⁰  Successful adopters of business analytics as a foundation for decision making keep a laser-like focus on:

  • Business outcomes; and 
  • The questions that lead to them. 
This also implies a gradual, evolutionary approach to adoption.  But more on that, later.

Next installment:  Is more data always better?


Note:  Missed my cadence last week.  A short-notice proposal turned a slack week into a frenetic one.  But back in the saddle again, this week.


¹ R. Merrifield, J. Calhoun, and D. Stevens, "The next revolution in productivity," Harvard Business Review, June 2008,  http://goo.gl/Y58xqm.
²J. W. Ross and P. Weill, Enterprise architecture as strategy, Boston:  HBR Press, 2006, http://goo.gl/B7J5P8.
³ R. Bartlett, A practitioner's guide to business analytics," McGraw-Hill, 2013, http://goo.gl/o6dTOS.
⁴ T. H. Davenport, J. G. Harris, and R. Morison, Analytics at work, Boston:  HBR Press, 2010, Location 112, Kindle Edition, http://goo.gl/olZkKm.
⁵ L. Samuelson, "Modeling of knowledge in economic analysis," Journal of Economic Literature, June 2004, pp. 367-402.
⁶ M. K. Brunnermeier, Asset pricing under asymmetric information, London:  Oxford, 2001, http://goo.gl/7IMFDv.
⁷ See, e.g., M. Amram and N. Kulatilaka, Real options, Boston: HBR Press, 1999, http://goo.gl/6Bswjk.
⁸ M. Bütler, Information Economics, New York:  Routledge, 2007, p. 42, Kindle Edition, http://goo.gl/1zZKQ1.
⁹ see, e.g., B. Schneier, Applied cryptography, 2nd ed, New York:  Wiley, 2001.
¹⁰ S. Lavalle, et al, "Big data, analytics, and the path from insights to value," MITSloan Management Review, Winter 2011, pp. 21 - 31, http://goo.gl/8RSn5H.
† This example is purely fictional.  Any resemblance to experiences by actual BMW dealerships is purely coincidental.


© The Quant's Prism, 2014