Monday, September 8, 2014

Analytics demonstration: How much information does your data contain?

Our current big data buzz is accompanied numerous points of excitement and controversy. The "more data" versus "better algorithms" is prominent among these. This debate sometimes resembles an old Miller Lite beer commercial

Multiple business analytics luminaries have weighed into the conversation.  These include:
  • Omar Tawakol, CEO of Oracle's data marketing operation BlueKai, concludes in Wall Street Journal's AllThingsD that "If you have to choose, having more data does indeed trump a better algorithm;"
  • Gartner's Andrew Frank emphasizes the nuance by asserting that "if you have to choose, maybe you’re on the wrong track;"
  • Information Week's Larry Stofko leads the conversation away from the dichotomy, towards "first we must make small data seamless, simple, and inexpensive;" while
  • Jeanne Ross, et al, assert in Harvard Business Review that the question is irrelevant: That "Until a company learns how to use data and analysis to support its operating decisions, it will not be in a position to benefit from big data."
This obviously is a very complicated issue.  It defies the "Tastes Great-Less Filling" dichotomy.  Very few initiatives provide simple, automatic Return on Investment (RoI). Big data analytics is no exception. 

More data only delivers marginal value if that data brings new information. I provide a technical demonstration in this installment. I try to show this in terms that don't require a mathematical PhD. 

I demonstrate this concept using a data set from the U.S. Bureau of Labor Statistics (BLS). This is the first of a series of quantitative demonstrations. The BLS data offers numerous opportunities for interesting data science demonstrations. I plan to spend much of the Autumn tinkering with it.


My data source:  Bureau of Labor Statistics' (BLS) O*Net.

O*Net is a career counseling service provided by the BLS. It is based on Holland's Occupational Themes. Psychologist John L. Holland theorized that our personality preferences dictate the types of occupations we are likely to enjoy the most. The Myers-Briggs Type Indicator (MBTI) and Herrmann Brain Dominance Instrument (HBDI) are based on similar theories.

O*Net users complete a survey of interests and lifestyle preferences.  The O*Net application applies analytics to identify occupations that are likely to be attractive to someone with specific sets interests and preferences.

O*Net's is based on a database that contains tables describing 923 different occupations. They are characterized in terms of:
  • Interests (9 attributes);
  • Knowledge (33 attributes);
  • Abilities (52 attributes); and
  • Skills (35 attributes).
Each attribute contains two numerical values: One for importance and the other for level. So, the model characterizes each of the 923 occupations in terms of 258 attributes — or features.

What makes this data set interesting?  Four factors attract me:
  1. It is a well-structured data set, requiring limited preparation;
  2. It's open, well-documented, and easily accessible;
  3. Some degree of authoritative research underlies the data set; and
  4. It seems to contain an awful lot of attributes by which to distinguish between 923 occupations.
This last factor is the point of the discussion here, today.  How many of these 258 features are really needed to describe 923 occupations.

I demonstrate below that most of the information* about the 923 occupations is contained in a small subset of the 258 features. I do not argue that 258 features is too many to guide job seekers or career changers. I make my point from an information theory point of view: More data attributes do not necessarily yield more information.

How much information do 258 features really contain?

It seems that introducing some geek-speak is unavoidable at this point. My apologies to the less mathematically interested readers. I will return — after a brief plunge into the weeds — to an explanation that is more plain-English.

Principal Component Analysis (PCA) tells us how information* is distributed between the features in a data set. PCA "reshuffles" the features using a mathematical method called Singular Value Decomposition (SVD). The features in our data set are often interrelated to each other in complicated ways. One factor might depend on several others. They contain lots of redundancy!

SVD unwinds all of the features into an "abstract" feature set. The members of this "abstract" set are unrelated to each other.  SVD also tells us how much information each "abstract" feature bears.

This is the point: Our data sets all contain a mixture of true, value-added information and noise. The actual information is usually spread across lots of different features. So is the noise. Data scientists — when they build analytics models — struggle to separate the real information from the noise. When the real information is spread across too many features, it's easier for it to become buried in the noise. The more features the information is spread across, the more diluted it is.

Figure 1 illustrates.  (I used the R open source statistics tool to produce Figure 1.) It shows how the information* is distributed across the 70 "abstract" features. Two of those features contain 95% of the information about how the 923 occupations are related to their associated skills. The remaining 68 features — containing only 5% of the information — are mostly noise.


Figure 1 — Two of 70 "abstract" features from the O*Net occupational skills table contain 95% of the information about how skills are related to occupations.

What does this mean? Most of the 70 "original" features in the skills data set are highly redundant. BLS provides a summary of the Holland model's history. They recently issued a newer version than the one I studied here. This release contains new 126 occupations.  

Are they adding new features to the occupations? I have not done that analysis. New features may be attractive from a career consulting point of view. Figure 1 however suggests that addition of new features should be guided by caution. New features should primarily add new information that is not duplicative of the 258 that are already in the data set.

But, Figure 1 just shows us the Skills features that BLS uses to characterize occupations. What about the rest?  What happens when we put them all together? Figure 2 gives us the answer.

We saw in Figure 1 that two abstract features contained 95% of the skills information. We add Abilities, Knowledge, and Interest in Figure 2. We might expect that four groups of factors would lead to eight information-bearing abstract features. This is not the case! 

Figure 2 — When we combine Skills, Abilities, Knowledge, and Interests, 95% of the combined information is contained in just four of 258 "abstract" features. 

The marginal value of information added by increasing our occupations characteristic data set from 70 factors in Figure 1 to 258 factors in Figure 2 is just two additional information-bearing abstract features. 
The Abilities, Knowledge, and Interests feature sets don't tell us much more than we know from Skills alone. 

We increase our feature set by a factor of about 3.7. We only double the number of information-bearing features. This suggests we have reached some form of diminishing returns

Leave the kitchen sink out of your data strategy for analytics.

In last week's installment I introduced some rudimentary concepts from Information Economics. The value businesses get from information is related to the opportunity that information gives to take more profitable actions. 

By extension, does data bring value? I illustrate above that adding data only increases value if the new data tells us something we don't already know. From an information point of view, Abilities, Knowledge, and Interests don't tell us much more about the 924 occupations in BLS' O*Net database than the Skills features alone. Adding those three data sets increased our feature set from 70 to 258, but only gave us two additional information bearing features.

But handling data is accompanied by cost! Technology vendors call our attention to the plummeting price of storage. Storing information is becoming increasingly cheap. Extracting value from data?  Not necessarily so.

The big data buzz might tempt us to expect that cost-free opportunities exist to extract value from data. We routinely are regaled with stories about Internet companies' use of data mining to illuminate many of the most intimate details of our lives. Google, Facebook, Yahoo!, and others accomplish this through strategic collection of data elements that introduce new information. They appear to have borrowed knowledge from the forensic computer security business to disinter our secrets.

What's the bottom line? Organizations seeking to extract value from data need to focus on value in their use of data. Begin with an Information Economics philosophy. Ask questions like:
  • What information will give me improved economic value?
  • How much value-bearing information is contained in the data that I already have? and
  • How much will handling the data cost to the new value-bearing information that I need?
Next installment:  Elements of the cost to extract information from data.



* I use the term "information" here imprecisely. "Singular values" produced by SVD are not information, technically speaking. They are related to information content. I intentionally commit this error in terminology for convenience of discussion.



© The Quant's Prism, 2014

No comments:

Post a Comment