The Quant's Prism: September 2014

Sunday, September 21, 2014

Extraction of business value from your data depends critically on data quality.

We assign economic value to information commensurate with how much better we can do with it than we can without it. This summarizes the point of my previous discussion on Information Economics. Information economics represents the foundation of business analytics.

My simple demonstration in that installment featured an element of uncertainty. I stated in my introduction to data science that data science — and business analytics, by extension — can never completely remove uncertainty from business decision-making. Separating uncertainty into resolvable and non-resolvable components is the best it can do.

We can do things about resolvable components of uncertainty underlying our decision-making. We can resolve uncertainty by:

Judiciously selecting the best information on which to base our decisions; and
Maximizing the quality of the information actually used.

I demonstrated in my previous installment that more data is not always better. Bringing more data into our decision-making only helps if the new data gives us information that we do not already have. I call attention here to the quality of data.

I begin this installment by offering a definition of data quality. I then demonstrate the assignment of economic value to quality. I use an "elementary game" illustration along the lines of Information Economics. I conclude with a foreshadowing of how data quality will influence subsequent installments in this series.

What do we mean by data "quality"?

Quality is one of those words that we use extensively without defining it. This blog's scientific approach requires that we define things precisely. I cannot formulate and scientifically test hypotheses about things that are not precisely defined.

An abstract thinker, I am fond of Robert Pirsig's¹ definition of quality. Pirsig fused the idea of esthetics — the study and characterization of beauty — with technical conformity. The esthetic aspect makes applying the scientific method more challenging. (Pirsig might be challenged by my imposition of this constraint on my line of inquiry.)

Quality management occupies a prominent role in the Program Management profession. The Project Management Book of Knowledge (PMBOK)² contains extensive sections devoted to quality management. The PMBOK itself lacks an obvious, explicit definition of quality.

A leading study guide for the Project Management Institute's (PMI) Project Management Professional (PMP) exam offers a passing definition: "Quality is defined as the degree to which the project fulfills requirements."³ This then leads us to the concept of quality as conformity, or compliance. This gets us closer to a useful definition of quality.

So, what constitutes quality in data? GS1 identifies five key attributes of data quality. GS1 is a non-profit organization dedicated to the design and implementation of global standards and solutions to improve the efficiency and visibility of supply and demand chains globally and across sectors. Their mission — focused on facilitating information charing — positions them well to speak with authority.

What are these five attributes? I summarize them here. I also offer both business and technical examples. I used slightly different terminology. For ease of memory, these are the "Five Cs of data quality."

Completeness. A quality data set contains all of the expected information on which a business decision is based. A database administrator (DBA) would focus technically on whether all of the required fields in each record are populated.
Conformity.⁴ A data set must, in its entirety, conform to a well defined semantic and logical framework. A DBA might talk about a "data schema", here. The schema captures the logical part. Business decision-making takes us beyond the DBA. Data derived from multiple sources must allow "apples versus apples" comparisons. We sometimes refer to this as "semantic" conformity.
Correctness. Data is used to represent facts, or states of nature. Correct data represents facts accurately. A DBA will focus on measurement or data entry errors. Correctness is usually an attribute derived from the data source. It is, consequently, difficult to define data systems capable of internally checking data correctness.
Currency. We desire to make decisions based on data representing the actual state of nature. A designer of a data management system will assign version numbers and time stamps. Business decision makers may also concerned about whether observations are made frequently enough to capture rapidly changing circumstances.
Confidence.⁵ Data confidence is related to "veracity", one of the key characteristics IBM assigns to "Big Data." A business decision maker considers whether the data source possesses the authority to make the assertion represented by the data. A designer of a data management system will handle this through seeking policies that designate data sources as "authoritative."

These five attributes — our "Five Cs of Data Quality" — provide us with a holistic framework for thinking about and managing data quality. I now illustrate the Information Economics of data quality.

Illustration of how data quality affects the value of information.

Market analytics gives obvious examples to illustrate opportunity costs resulting from poor data quality. Let's extend the BMW dealership illustration used a couple of installments ago. I first used it to show the economic value of information. I'll show here how poor data quality dilutes that value.

Let's say that my market size is about 100,000 potential customers. I define my market as households with net income in excess of $85,000 within a specific geographic radius. Market research gives me the following features about households in my market:

Household incomes, conforming to a power-law statistic distribution;⁶
Age of oldest current automobile, exponentially distributed;⁷ and
Automobile brand preferences, uniformly distributed.

Past history gives me a model I use to relate these features to the probability that I can sell them a new BMW. Figure 1 shows the results of the model. It ranks households according to my estimate of the likelihood of sale. It contains two curves, an "Actual", and an "Underestimated." I'll turn to the "underestimated" curve shortly.

You see in Figure 1 that the top decile — ten percent — contains most of the effective market. This is the households that are really likely to buy from me. So, my effective market is actually only about 10,000 households. If the average sale generates $50,000, this market is valued at $500 million.

Figure 1 — Predictive analytics tells me which households are most likely to buy, based on income, age of current vehicle, and brand preference.

Let's say that I previously used targeted marketing to improve the sales likelihood by an average of ten percent for households in in the 65% to 85% range of probability of purchase. These households are the best candidates for a targeted marketing campaign. There are 1,669 households in this group. Targeted marketing increases the size effective market by approximately 1,660 households.

Let's now turn to data quality. Unbeknownst to me, about a quarter of my customer preference ratings are underestimated by 20%. This is a data correctness issue. My perception of the market is, as a result, represented by the "Underestimated" curve in Figure 1.

What does this do to my business? The curve in Figure 1 only appears to shift a small amount. But the real effect is to reduce the size of my effective market is reduced by 230 households. This amounts to around $11.5 million in reduced revenue opportunity.

Some of this loss arises from misidentifying the best potential customers for my targeted marketing. On one hand, I'm not applying targeted marketing to customers who might benefit from it the most. On the other, I expend resource on others who are less likely to be swayed.

Poor data quality leads to misdirection of my targeted marketing efforts. This is an information economics issue. The utility of my information — how much better I can do with it than without it — is directly related to data quality.

What, then, does data quality mean for enterprise business analytics?

We see here that quality influences the value of information. Data quality — characterized by the "Five Cs" — does not necessarily occur automatically. That quality is accompanied by cost is unsurprising.

The PBMOK briefly addresses the cost of quality.⁸ These concepts extend to data quality management. Two data quality costs arise:

Costs of conformance are accrued through expenditure of effort to explicitly address data quality objectives; and

Costs of nonconformance arise from adverse results due to shortfalls in data quality.

My illustration above highlights an example of the second. Data quality nonconformance dilutes the economic value of information.

Enterprises should strive to balance these two costs. In other words, they should expend efforts towards data quality commensurate with the information value requirements of a specific business context. Information economics provides the basis for business case analysis about data quality.

Business critical information supports decisions about issues for which considerable value is at stake. Data for these applications demand high quality. For example:

Decisions about major investments often placing substantial amounts of capital at risk; and

Financial reporting must be governed by financial controls regarding which legal requirements for attestation apply.⁹

An entire industry developed to provide specialized data management systems to address auditing and financial reporting requirements. These systems are Enterprise Resource Planning (ERP) systems.

At the opposite end of the spectrum, inserting a single targeted advertisement into an individual user's web browsing session costs very little. Netflix movie recommendations or Amazon book recommendations each similarly cost very little. All targeted ad placements or consumer preference recommendations must collectively achieve an average accuracy.

References.

¹ R. M. Pirsig, Zen and the art of motorcycle maintenance, New York: Harper Collins, 1974, Kindle Edition, p. 204, http://goo.gl/si1ayP.

² Guide to the project management book of knowledge (PMBOK), fourth edition, Newtown Square, PA: Project Management Institute, Chapter 8, http://goo.gl/anjxhd.

³ Rita Mucahy's™ PMP® Exam Prep, seventh edition, RMC Publications, 2011, p. 263, http://goo.gl/k0ml4D.

⁴ GS1 talks here about "consistency". I take the liberty combining two factors, consistency and "standards-based."

⁵ GS1 does not address data confidence.
⁶ N. N. Taleb, The black swan, New York: Random, http://goo.gl/Ermy9s.
⁷ L. Kleinrock, Queuing systems, Volume I: Theory, New York: Wiley, 1975, pp. 65-71, http://goo.gl/Dz8mVE.

⁸ Guide to the project management book of knowledge (PMBOK), fourth edition, Newtown Square, PA: Project Management Institute, §8.1.2, http://goo.gl/anjxhd.

⁹ e.g., "Internal control over financial reporting in exchange act periodic reports of non-accelerated filers," Security and Exchange Commission, 17 CFR Parts 210, 229, and 249, http://www.sec.gov/rules/final/2010/33-9142.pdf.

Monday, September 8, 2014

Analytics demonstration: How much information does your data contain?

Our current big data buzz is accompanied numerous points of excitement and controversy. The "more data" versus "better algorithms" is prominent among these. This debate sometimes resembles an old Miller Lite beer commercial.

Multiple business analytics luminaries have weighed into the conversation. These include:

Omar Tawakol, CEO of Oracle's data marketing operation BlueKai, concludes in Wall Street Journal's AllThingsD that "If you have to choose, having more data does indeed trump a better algorithm;"
Gartner's Andrew Frank emphasizes the nuance by asserting that "if you have to choose, maybe you’re on the wrong track;"
Information Week's Larry Stofko leads the conversation away from the dichotomy, towards "first we must make small data seamless, simple, and inexpensive;" while
Jeanne Ross, et al, assert in Harvard Business Review that the question is irrelevant: That "Until a company learns how to use data and analysis to support its operating decisions, it will not be in a position to benefit from big data."

This obviously is a very complicated issue. It defies the "Tastes Great-Less Filling" dichotomy. Very few initiatives provide simple, automatic Return on Investment (RoI). Big data analytics is no exception.

More data only delivers marginal value if that data brings new information. I provide a technical demonstration in this installment. I try to show this in terms that don't require a mathematical PhD.

I demonstrate this concept using a data set from the U.S. Bureau of Labor Statistics (BLS). This is the first of a series of quantitative demonstrations. The BLS data offers numerous opportunities for interesting data science demonstrations. I plan to spend much of the Autumn tinkering with it.

My data source: Bureau of Labor Statistics' (BLS) O*Net.

O*Net is a career counseling service provided by the BLS. It is based on Holland's Occupational Themes. Psychologist John L. Holland theorized that our personality preferences dictate the types of occupations we are likely to enjoy the most. The Myers-Briggs Type Indicator (MBTI) and Herrmann Brain Dominance Instrument (HBDI) are based on similar theories.

O*Net users complete a survey of interests and lifestyle preferences. The O*Net application applies analytics to identify occupations that are likely to be attractive to someone with specific sets interests and preferences.

O*Net's is based on a database that contains tables describing 923 different occupations. They are characterized in terms of:

Interests (9 attributes);
Knowledge (33 attributes);
Abilities (52 attributes); and
Skills (35 attributes).

Each attribute contains two numerical values: One for importance and the other for level. So, the model characterizes each of the 923 occupations in terms of 258 attributes — or features.

What makes this data set interesting? Four factors attract me:

It is a well-structured data set, requiring limited preparation;
It's open, well-documented, and easily accessible;
Some degree of authoritative research underlies the data set; and
It seems to contain an awful lot of attributes by which to distinguish between 923 occupations.

This last factor is the point of the discussion here, today. How many of these 258 features are really needed to describe 923 occupations.

I demonstrate below that most of the information* about the 923 occupations is contained in a small subset of the 258 features. I do not argue that 258 features is too many to guide job seekers or career changers. I make my point from an information theory point of view: More data attributes do not necessarily yield more information.

How much information do 258 features really contain?

It seems that introducing some geek-speak is unavoidable at this point. My apologies to the less mathematically interested readers. I will return — after a brief plunge into the weeds — to an explanation that is more plain-English.

Principal Component Analysis (PCA) tells us how information* is distributed between the features in a data set. PCA "reshuffles" the features using a mathematical method called Singular Value Decomposition (SVD). The features in our data set are often interrelated to each other in complicated ways. One factor might depend on several others. They contain lots of redundancy!

SVD unwinds all of the features into an "abstract" feature set. The members of this "abstract" set are unrelated to each other. SVD also tells us how much information each "abstract" feature bears.

This is the point: Our data sets all contain a mixture of true, value-added information and noise. The actual information is usually spread across lots of different features. So is the noise. Data scientists — when they build analytics models — struggle to separate the real information from the noise. When the real information is spread across too many features, it's easier for it to become buried in the noise. The more features the information is spread across, the more diluted it is.

Figure 1 illustrates. (I used the R open source statistics tool to produce Figure 1.) It shows how the information* is distributed across the 70 "abstract" features. Two of those features contain 95% of the information about how the 923 occupations are related to their associated skills. The remaining 68 features — containing only 5% of the information — are mostly noise.

Figure 1 — Two of 70 "abstract" features from the O*Net occupational skills table contain 95% of the information about how skills are related to occupations.

What does this mean? Most of the 70 "original" features in the skills data set are highly redundant. BLS provides a summary of the Holland model's history. They recently issued a newer version than the one I studied here. This release contains new 126 occupations.

Are they adding new features to the occupations? I have not done that analysis. New features may be attractive from a career consulting point of view. Figure 1 however suggests that addition of new features should be guided by caution. New features should primarily add new information that is not duplicative of the 258 that are already in the data set.

But, Figure 1 just shows us the Skills features that BLS uses to characterize occupations. What about the rest? What happens when we put them all together? Figure 2 gives us the answer.

We saw in Figure 1 that two abstract features contained 95% of the skills information. We add Abilities, Knowledge, and Interest in Figure 2. We might expect that four groups of factors would lead to eight information-bearing abstract features. This is not the case!

Figure 2 — When we combine Skills, Abilities, Knowledge, and Interests, 95% of the combined information is contained in just four of 258 "abstract" features.

The marginal value of information added by increasing our occupations characteristic data set from 70 factors in Figure 1 to 258 factors in Figure 2 is just two additional information-bearing abstract features. The Abilities, Knowledge, and Interests feature sets don't tell us much more than we know from Skills alone.

We increase our feature set by a factor of about 3.7. We only double the number of information-bearing features. This suggests we have reached some form of diminishing returns.

Leave the kitchen sink out of your data strategy for analytics.

In last week's installment I introduced some rudimentary concepts from Information Economics. The value businesses get from information is related to the opportunity that information gives to take more profitable actions.

By extension, does data bring value? I illustrate above that adding data only increases value if the new data tells us something we don't already know. From an information point of view, Abilities, Knowledge, and Interests don't tell us much more about the 924 occupations in BLS' O*Net database than the Skills features alone. Adding those three data sets increased our feature set from 70 to 258, but only gave us two additional information bearing features.

But handling data is accompanied by cost! Technology vendors call our attention to the plummeting price of storage. Storing information is becoming increasingly cheap. Extracting value from data? Not necessarily so.

The big data buzz might tempt us to expect that cost-free opportunities exist to extract value from data. We routinely are regaled with stories about Internet companies' use of data mining to illuminate many of the most intimate details of our lives. Google, Facebook, Yahoo!, and others accomplish this through strategic collection of data elements that introduce new information. They appear to have borrowed knowledge from the forensic computer security business to disinter our secrets.

What's the bottom line? Organizations seeking to extract value from data need to focus on value in their use of data. Begin with an Information Economics philosophy. Ask questions like:

What information will give me improved economic value?
How much value-bearing information is contained in the data that I already have? and
How much will handling the data cost to the new value-bearing information that I need?

Next installment: Elements of the cost to extract information from data.

* I use the term "information" here imprecisely. "Singular values" produced by SVD are not information, technically speaking. They are related to information content. I intentionally commit this error in terminology for convenience of discussion.