Sunday, September 21, 2014

Extraction of business value from your data depends critically on data quality.

We assign economic value to information commensurate with how much better we can do with it than we can without it. This summarizes the point of my previous discussion on Information Economics. Information economics represents the foundation of business analytics.

My simple demonstration in that installment featured an element of uncertainty. I stated in my introduction to data science that data science — and business analytics, by extension — can never completely remove uncertainty from business decision-making. Separating uncertainty into resolvable and non-resolvable components is the best it can do.

We can do things about resolvable components of uncertainty underlying our decision-making. We can resolve uncertainty by:
  • Judiciously selecting the best information on which to base our decisions; and
  • Maximizing the quality of the information actually used.
I demonstrated in my previous installment that more data is not always better. Bringing more data into our decision-making only helps if the new data gives us information that we do not already have.  I call attention here to the quality of data.

I begin this installment by offering a definition of data quality. I then demonstrate the assignment of economic value to quality.  I use an "elementary game" illustration along the lines of Information Economics.  I conclude with a foreshadowing of how data quality will influence subsequent installments in this series.


What do we mean by data "quality"?

Quality is one of those words that we use extensively without defining it. This blog's scientific approach requires that we define things precisely. I cannot formulate and scientifically test hypotheses about things that are not precisely defined.

An abstract thinker, I am fond of Robert Pirsig's¹ definition of quality. Pirsig fused the idea of esthetics — the study and characterization of beauty — with technical conformity. The esthetic aspect makes applying the scientific method more challenging.  (Pirsig might be challenged by my imposition of this constraint on my line of inquiry.)

Quality management occupies a prominent role in the Program Management profession. The Project Management Book of Knowledge (PMBOK)² contains extensive sections devoted to quality management. The PMBOK itself lacks an obvious, explicit definition of quality. 

A leading study guide for the Project Management Institute's (PMI) Project Management Professional (PMP) exam offers a passing definition: "Quality is defined as the degree to which the project fulfills requirements."³ This then leads us to the concept of quality as conformity, or compliance. This gets us closer to a useful definition of quality.

So, what constitutes quality in data? GS1 identifies five key attributes of data quality. GS1 is a non-profit organization dedicated to the design and implementation of global standards and solutions to improve the efficiency and visibility of supply and demand chains globally and across sectors. Their mission — focused on facilitating information charing — positions them well to speak with authority. 

What are these five attributes?  I summarize them here.  I also offer both business and technical examples.  I used slightly different terminology.  For ease of memory, these are the "Five Cs of data quality."
  • Completeness.  A quality data set contains all of the expected information on which a business decision is based. A database administrator (DBA) would focus technically on whether all of the required fields in each record are populated.
  • Conformity.⁴ A data set must, in its entirety, conform to a well defined semantic and logical framework. A DBA might talk about a "data schema", here. The schema captures the logical part. Business decision-making takes us beyond the DBA. Data derived from multiple sources must allow "apples versus apples" comparisons. We sometimes refer to this as "semantic" conformity.
  • Correctness.  Data is used to represent facts, or states of nature. Correct data represents facts accurately. A DBA will focus on measurement or data entry errors. Correctness is usually an attribute derived from the data source. It is, consequently, difficult to define data systems capable of internally checking data correctness.
  • Currency. We desire to make decisions based on data representing the actual state of nature. A designer of a data management system will assign version numbers and time stamps. Business decision makers may also concerned about whether observations are made frequently enough to capture rapidly changing circumstances.
  • Confidence.⁵  Data confidence is related to "veracity", one of the key characteristics IBM assigns to "Big Data." A business decision maker considers whether the data source possesses the authority to make the assertion represented by the data. A designer of a data management system will handle this through seeking policies that designate data sources as "authoritative."
These five attributes — our "Five Cs of Data Quality" — provide us with a holistic framework for thinking about and managing data quality.  I now illustrate the Information Economics of data quality.

Illustration of how data quality affects the value of information.

Market analytics gives obvious examples to illustrate opportunity costs resulting from poor data quality. Let's extend the BMW dealership illustration used a couple of installments ago. I first used it to show the economic value of information. I'll show here how poor data quality dilutes that value.

Let's say that my market size is about 100,000 potential customers. I define my market as households with net income in excess of $85,000 within a specific geographic radius. Market research gives me the following features about households in my market:

  • Household incomes, conforming to a power-law statistic distribution;
  • Age of oldest current automobile, exponentially distributed;⁷ and
  • Automobile brand preferences, uniformly distributed.
Past history gives me a model I use to relate these features to the probability that I can sell them a new BMW. Figure 1 shows the results of the model.  It ranks households according to my estimate of the likelihood of sale.  It contains two curves, an "Actual", and an "Underestimated."  I'll turn to the "underestimated" curve shortly.

You see in Figure 1 that the top decile — ten percent — contains most of the effective market. This is the households that are really likely to buy from me. So, my effective market is actually only about 10,000 households. If the average sale generates $50,000, this market is valued at $500 million.
Figure 1 — Predictive analytics tells me which households are most likely to buy, based on income, age of current vehicle, and brand preference.

Let's say that I previously used targeted marketing to improve the sales likelihood by an average of ten percent for households in in the 65% to 85% range of probability of purchase. These households are the best candidates for a targeted marketing campaign. There are 1,669 households in this group. Targeted marketing increases the size effective market by approximately 1,660 households.

Let's now turn to data quality. Unbeknownst to me, about a quarter of my customer preference ratings are underestimated by 20%. This is a data correctness issue.  My perception of the market is, as a result, represented by the "Underestimated" curve in Figure 1.

What does this do to my business?  The curve in Figure 1 only appears to shift a small amount.  But the real effect is to reduce the size of my effective market is reduced by 230 households. This amounts to around $11.5 million in reduced revenue opportunity. 

Some of this loss arises from misidentifying the best potential customers for my targeted marketing. On one hand, I'm not applying targeted marketing to customers who might benefit from it the most.  On the other, I expend resource on others who are less likely to be swayed. 

Poor data quality leads to misdirection of my targeted marketing efforts. This is an information economics issue. The utility of my information — how much better I can do with it than without it — is directly related to data quality.


What, then, does data quality mean for enterprise business analytics?

We see here that quality influences the value of information. Data quality — characterized by the "Five Cs" — does not necessarily occur automatically. That quality is accompanied by cost is unsurprising. 

The PBMOK briefly addresses the cost of quality.⁸ These concepts extend to data quality management. Two data quality costs arise:
  • Costs of conformance are accrued through expenditure of effort to explicitly address data quality objectives; and
  • Costs of nonconformance arise from adverse results due to shortfalls in data quality.
My illustration above highlights an example of the second. Data quality nonconformance dilutes the economic value of information.

Enterprises should strive to balance these two costs. In other words, they should expend efforts towards data quality commensurate with the information value requirements of a specific business context. Information economics provides the basis for business case analysis about data quality.

Business critical information supports decisions about issues for which considerable value is at stake. Data for these applications demand high quality.  For example:

  • Decisions about major investments often placing substantial amounts of capital at risk; and
  • Financial reporting must be governed by financial controls regarding which legal requirements for attestation apply.⁹
An entire industry developed to provide specialized data management systems to address auditing and financial reporting requirements. These systems are Enterprise Resource Planning (ERP) systems. 

At the opposite end of the spectrum, inserting a single targeted advertisement into an individual user's web browsing session costs very little. Netflix movie recommendations or Amazon book recommendations each similarly cost very little. All targeted ad placements or consumer preference recommendations must collectively achieve an average accuracy. 


References.

¹ R. M. Pirsig, Zen and the art of motorcycle maintenance, New York:  Harper Collins, 1974, Kindle Edition, p. 204, http://goo.gl/si1ayP.
² Guide to the project management book of knowledge (PMBOK), fourth edition, Newtown Square, PA:  Project Management Institute, Chapter 8, http://goo.gl/anjxhd.
³ Rita Mucahy's™ PMP® Exam Prep, seventh edition, RMC Publications, 2011, p. 263, http://goo.gl/k0ml4D.
⁴ GS1 talks here about "consistency". I take the liberty combining two factors, consistency and "standards-based."
⁵ GS1 does not address data confidence.
⁶ N. N. Taleb, The black swan, New York: Random, http://goo.gl/Ermy9s. 
⁷ L. Kleinrock, Queuing systems, Volume I: Theory, New York: Wiley, 1975, pp. 65-71, http://goo.gl/Dz8mVE.
⁸ Guide to the project management book of knowledge (PMBOK), fourth edition, Newtown Square, PA:  Project Management Institute, §8.1.2, http://goo.gl/anjxhd.
e.g., "Internal control over financial reporting in exchange act periodic reports of non-accelerated filers," Security and Exchange Commission, 17 CFR Parts 210, 229, and 249, http://www.sec.gov/rules/final/2010/33-9142.pdf.

© The Quant's Prism, 2014

No comments:

Post a Comment