The Quant's Prism: May 2015

We assume that Ivy-League graduates have it made. Admission to an Ivy leaves one set for life. Not getting into an Ivy consigns a young person to a life of mediocrity. One's alma mater is one's fate, after all.

But is this really true? Do the data support this assertion? Research into questions like these lands us smack dab into the middle of two conundrums central to Big-Data Analytics. First, how do we tell what causes what? And second, which factors are the strongest determinants of an outcome?

Figure 1 — A Harvard diploma represents the ultimate capstone for a successful youth. (Source: World Press.)

This time of year provides us an opportune case study to consider these Big-Data analytics challenges.¹ High school seniors last month were informed of their destinies. April is college-admissions season. Acceptance and rejection letters fulfilled dreams and shattered hopes for many an aspiring collegian.

Meanwhile, college seniors graduate this month. They are in the midst of realizing their destinies. Corporate recruiters, networking, and the formidable quest for employment now beckon. Those fortunate enough to stroll across an Ivy commencement stage are certainly predestined for greatness. Are they not?

Scientific research fortunately allows us to examine these widely held presuppositions. We consider them here from perspective of our two conundrums. The first is causality. To what extent does the name of the university on one's university diploma really determine future earnings? Is the relationship casual? Does one thing inevitably lead to the other?

Secondly, we care about determinism. Even if alma mater causally determines future earnings, are other factors more important? Do other factors have a bigger say?

Causality: Is it the school or is it the applicant?

Researchers Stacy Berg-Dale and Alan B. Krueger looked into the exact question of alma mater versus earnings.² They took an innovative tack on the question however. They mined a data set for both cognitive and non-cognitive indicators of students' competitiveness for elite colleges. The cognitive factors included the usual suspects: Grade Point Averages and SAT scores.

The non-cognitive characteristics included the "soft" skills. These include indications of motivation, ambition, and maturity. These attributes are evidenced in interviews, essays, and letters of recommendation. They are obviously more difficult to measure.

Scientific evidence showing that "soft" skills are stronger predictors of life outcomes than book knowledge is now a couple of decades old.³ Education research⁴ demonstrates their unsurprising importance to learning.

Figure 2 highlights Berg-Dale's and Krueger's research. Figure 2(a) represents the "alma-mater-is-destiny" hypothesis we commonly assume. Figure 2(b) shows the model our researchers applied to their data. They used regression analysis. Figure 2(a) presumes causality. The researchers' method — regression analysis — only uses correlation.

Figure 2 — Dale-Berg and Krueger² used a predictive-modeling technique based solely on correlations to try to understand causal relationship between alma mater and earnings.

Figure 2(c) shows an alternative view. Dale-Berg and Krueger hypothesize that this model might explain the data better than Figure 2(a). It's the applicant, not the university! The traits that make one competitive for admission to an elite university tend to translate into higher earnings after college. Alma-mater affiliation only marginally contributes to increased earnings.

The alma-mater-is-destiny presupposition reverses the actual causalities underlying students' prospects. It's not the school. It's the competitive traits of the applicant. We suspect that being immersed in an environment replete with intelligent, ambitious peers probably confers some advantages. The tenacity and talent required to become competitive — qualified but not selected — to join that peer group may largely offset many advantages of actually joining the group.

Note that the researchers do not explicitly demonstrate the causality in Figure 2(c). They use correlation — the model in Figure 2(b) — to cast serious doubt on the presumption in Figure 2(a). They then offer an alternative hypothesis. The alternative explanation in Figure 2(c) remains a hypothesis! This is an unavoidable limitation of their regression-based approach.

Determinism: Just how much does the school really matter?

Determinism — just how much each factor contributes — is equally important. Berg-Dale and Krueger found that alma mater makes a weak relative contribution to post-college earnings. I use another closely-related study to illustrate determinism.

Gaertner and McCarty⁵ studied factors that predict college readiness. They wanted to diagnose the traits that lead to college success. Educators seek to anticipate students' success and challenges starting as early as the eighth grade.

Figure 3 summarizes results from Gaertner's and McCarty's research. Like Dale-Berg and Krueger, they used regression analysis. They looked at four indicators of college readiness. They also considered four factors believed to determine college readiness. "Middle-school indicators" in Figure 3 refers to a set of non-cognitive factors, like those in Figure 2.

Figure 3 — We explain college readiness in terms of multiple factors, each of which explains a fraction of the overall readiness. (After Gaertner and McClarty.⁵)

Quants evaluate their regression models using a quantity called "coefficient of determination", or "R²". Figure 3 shows this quantity from Gaertner's and McCarty's analysis. R² explains the proportion of total variation contributed by each factor in the model. We see in Figure 3 that SAT and High-School GPA (HSGPA) contributed most to explaining students' college readiness. The "Middle-school indicators" made the second-largest contribution.

There is a problem with R² however. Specifically, the R²'s for all the factors must sum up to a total less than one. So the more factors you mix into your model, the smaller the relative contribution each may make.

This is particularly true of the factors are correlated with each other. Correlated factors "divide the credit" in regression analysis. It is hard to tell which really contributes the most. Regression analysis cannot by itself tell you that one factor causes another! We probably want to assign a higher weight to a root cause than to some intermediate affect. Regression cannot do this.

The limits of regression analysis.

Regression analysis is the most widely-used technique in the Big-Data Analytics tookbox. We see here two indications of its limitations. First, it does not account for — or indicate — causality. It simply uses knowledge about how things are correlated with each other.

This can be a very big deal! Root causes are vitally important to many business decisions. Analytics results based on regression analysis alone have limited value for root-cause analysis. By themselves, they tell us very little about cause and Affect.

Figure 4 illustrates this. Figure 4(a) shows a notional scenario comprised of a chain of cause-and-affect relationships. We are interested in the "Affect of Primary Interest" — or primary affect. This affect mostly occurs because of a single root cause. The root cause does not directly lead to the primary affect. It acts through a pair of intermediate affects.

Figure 4(b) illustrates how regression analysis treats this scenario. Our primary affect is directly correlated with the root cause and with the two intermediate affects. Regression analysis cannot distinguish between the three. There is no guarantee that the regression model will even assign the highest weight to the root cause!

Figure 4 — Analytics methods based on regression analysis cannot detect root causes in chains of causal events.

Many applicants to elite universities suffer acutely from not understanding this kind of causality. They experience enormous stress during college-admissions season. Lots of them succumb to a winner-take-all view of university selections.

This is a problem for at least two reasons. First, an ambitious, disciplined, talented young person tends to come out okay irrespective of where she goes to school. Second, seventeen-year-olders simply don't know enough about themselves — or about life — to make rest-of-their-life decisions anyway. The stakes are simply not that high!

Determinism represents the second issue with regression analysis. Regression analysis uses weighted sums. We're simply adding things up. These sums can only be so large. A regression model may not even assign the highest weight to the root cause in a causal chain of events! This oversimplifies the world. It also provides an incomplete view.

The real world consists of complex, causal interrelationships. Life is chocked full of vicious and virtuous cycles! Correlation alone doesn't really explain these very well. It only tells us that some things tend to go together — that they are "co-related."

Leaders using analytics to make decisions must understand how their analysts formulate recommendations.⁶ Big-data analytics contains no alchemy. It offers neither sorcery nor black magic. Decision makers must apply extreme caution to analytics results that cannot be explained apart from, "This is what the tool says."

References

¹ D. Thompson, "'It Doesn't Matter Where You Go to College': Inspirational, but Wrong," The Atlantic, April 2, 2015, http://www.theatlantic.com/business/archive/2015/04/the-3-percent-crisis/389396/.
² S. B. Dale and A. B. Krueger, "Estimating the payoff to attending a more selective college: An application of Selection on observables and unobservables", Quarterly Journal of Economics, Oxford Journals, December 2002, http://qje.oxfordjournals.org/content/117/4/1491.short.
³ D. Coleman, Emotional Intelligence, Random House, 2002, http://goo.gl/PD5sJf.
⁴ C. A. Farrington, et al, "Teaching Adolescents To Become Learners The Role of Noncognitive Factors in Shaping School Performance: A Critical Literature Review", University of Chicago Consortium of Chicago School Research (CCSR), June 2012, https://goo.gl/TJpXGl.
⁵ M. N. Gaertner and K. L. McClarty, "Performance, Perseverance, and the Full Picture of College Readiness", Educational Measurement: Issues and Practice, National Council on Measurements in Education, April 2015, http://onlinelibrary.wiley.com/doi/10.1111/emip.12066/abstract.
⁶ T. H. Davenport, "Keeping up with your quants," Harvard Business Review, July-Aug 2013, http://goo.gl/rtkq8G.

The Quant's Prism

Sunday, May 24, 2015

Ivy-league payoffs: An illustration of causality and determinism

Causality: Is it the school or is it the applicant?

Determinism: Just how much does the school really matter?

The limits of regression analysis.

References