Thursday, August 14, 2014

The "Science" of Data Science.

The inaugural installment of this journal described a broad vision related to the study of "data science." I attempt here to focus that ambition somewhat. Attempting to define the term "data science" is the objective of the this discussion.  

Most importantly, I want to assert a point of view that emphasizes science. I also approach data science from the point of view of a business professional. I am interested in applications of data science to deepening insight into strategies and operations of businesses and other organizations.

As an aside, I found this task more difficult than expected. A recent NY Times op-ed piece asserts that "Writing Is a Risky, Humiliating Endeavor." My experience here may corroborate that author's point of view.

The challenge with defining "data science" is that it is intertwined with a number of related IT-industry terms including, but not limited to:
  • Business intelligence;
  • Analytics and business analytics; and
  • Big data.
A number of Internet tools attempt to measure the intensity of attention given to topics.  Google Trends reports relative rates of search-engine queries.  The graphic below reports the relative frequencies for our four terms of interest.  We see that the term analytics attracts the lion's share of the interest.

"Data science" is practically lost in the noise. The growth rate for "data-science" queries appears relatively steady. Seeing total-volume statistics would be interesting.


A number of definitions of "data science" have been attempted:
  • Strata's Mike Loukides published What is Data Science? through O'Reilly.  This meandering narrative covers technologies and anecdotal instances of their use. A definition of data science is not concisely presented.
  • Gartner performed a text-analytics study identifying the core competencies of a data scientist as data management, analytics modeling, and business analysis. Gartner extends this list to "soft" consultant-related skills.  We find here an indirect definition of data science — in terms of its primary practitioner.  
  • Wikipedia provides a definition represented as an intersection of computer science, applied mathematics, and subject-domain expertise for a specific field.
  • Forbes contributor Gil Press, in "A very short history of data science," observes 52 years of history of the use of the term.
Regrettably, no definition suitable to our purposes conspicuously presents itself.  I therefore presumptuously undertake to proffer my own. 


What is Data Science?

We seek a business-focused definition of data science.  We also emphasize the scientific aspect. Business analytics captures one aspect of data science.  Data science can moreover be applied to "big data."

My approach is inspired by a 2008 Gartner report, "Gartner Clarifies the Definition of the Term Enterprise Architecture."  I address each of the points in the outline used for Gartner's definition.

What it isData science is the application of the scientific method to extract actionable insights from diverse sets of business information. 

What the scope is:  Data science prescribes systematic, reproducible methods to the entire information lifecycle from information source to information consumer.  This includes mathematics, data visualization, information management, and business analysis.   The breadth of data science's span transcends business domains of strategy, operations, finance, and logistics. 

What the result is:  The application of data science to business information leads to the best achievable information on which to base a specific business decision or action. Its results are organized and presented in a manner specifically designed for the information consumer.  They also contain indications of the degree of confidence to which the consumer should assign to them.

What the benefit is:  Consumers of information produced by data science receive the best achievable, actionable information specific to high-priority business questions. This results are provided the minimum expenditure of resources compared to less-scientific approaches. The scientific method leads to application of the most-appropriate mathematical and information-management methods to answer specific business questions given the available data.


How is Data Science related to the Scientific Method?

My preference for the term "data science" over "analytics" or "big data" is grounded in the prominence of science.  "Applying classic scientific methods to the practice of management¹" is one of the key promises offered by the movement encompassing data science, business analytics, and big data. "...The ultimate goal of data science is improving decision making, as this generally is of paramount interest to business.²" Improved business decision-making leads to:

  • Improved predictability of decision activities;
  • Reproducibility and transparency in the decision-making process; and
  • Precise separation of uncertainty into aspects that can be mitigated and those that cannot.
Data science — including big data, business analytics, and business intelligence — can never completely remove uncertainty from making decisions. It does separate resolvable uncertainties from those that remain — to varying degrees — "known unknowns."  

So how do business-focused data scientists apply the scientific method to business analytics?  Business analytics thought leaders describe high-level approaches in leading business journals.  

This list is by no means exhaustive.  They are consistent with analysis methods in which data scientists are indoctrinated during their educations.

I turn to Pirsig⁵ — an admittedly quirky source — for an accessible summary here.  Pirsig summarizes the scientific method in four steps:

  1. State the problem in terms that are no more than you are positive that you know;
  2. Formulate hypothesis of candidate causes for the problem;
  3. Design experiments to test the each hypothesis in isolation;
  4. Interpret the experiment results in terms of whether the hypothesis is proven or refuted; and
  5. Update the candidate hypotheses and return to step 3.
This is similar to Davenport's six-step "procedure."

This rigorous, systematic approach is necessary and sufficient to achieve the fundamental objective of data science in business decision-making:  Isolating uncertainty factors into those that can be resolved and those that remain uncertain.  Cutting corners leaves residual uncertainty.

Practicing analysts who may have read this far may ask, "But, what about unsupervised learning?"  I will address unsupervised learning in depth in a later installment dedicated to the topic.  Suffice it to say, for now, that unsupervised learning provides a source of candidate hypotheses.  Each resultant candidate hypothesis must be scientifically tested as described above.

Next Installment:  The economics of information.


¹ "Big data: The next frontier for innovation, competition, and productivity," McKinsey Global Institute, McKinsey & Company, 2011, p. 98,  http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights%20and%20pubs/MGI/Research/Technology%20and%20Innovation/Big%20Data/MGI_big_data_full_report.ashx.
² F. Provost, T. Fawcett, "Data science and its relationship to big data and data-driven decision making," Big Data, Mary Ann Liebert, Inc., February 13, 2013, p. BD 53, http://online.liebertpub.com/doi/pdfplus/10.1089/big.2013.1508.
³ LaValle, Steve, et al, Big Data, Analytics and the Path From Insights to Value, MITSloan Management Review, Winter 2011, http://sloanreview.mit.edu/article/big-data-analytics-and-the-path-from-insights-to-value/
⁴ T. Davenport, "Keeping up with your quants," Harvard Business Review, July-August 2013, http://hbr.org/2013/07/keep-up-with-your-quants/ar/1
⁵ Pirsig, R. M., Zen and the Art of Motorcycle Maintenance, Harper-Collins, 1974, (Kindle edition 2009), http://goo.gl/si1ayP


© The Quant's Prism, 2014

No comments:

Post a Comment