The following is an excerpt from When Big Data Was Small: My Life in Baseball Analytics and Drug Design by Richard D. Cramer (May 2019). Cramer started analyzing baseball statistics in the mid-1960s, after graduating from Harvard and MIT, and by 1969 he had discovered (or reinvented) the metric now known as OPS. He is the co-founder of STATS Inc. and has done important work with both SABR and Retrosheet.
The world we live in is mostly uncertain and unpredictable. Yet each and every one of our ancestors, all the way back to the primordial archaea, is one of the few organisms in each generation who survived and reproduced, in part because they better predicted (though far more often lucked into!) what would happen next and took appropriate action. A drive for better prediction has been baked into our genes.
Okay. But where do predictions come from? If you think about it a little, the only possible basis for any prediction is previous experience, by oneself or (especially for humans) as reported by others. Prediction then is a recognition, consciously or unconsciously, that some pattern among past experiences makes some future event more likely to occur. And such patterns are more likely to be recognized whenever experiences have been recorded and somehow organized. For example, recognizing that weather tends to repeat itself in 365-day patterns depended on someone counting sunrises and associating each day’s weather with the pattern of the stars—over many years. Using that prediction to decide when to hunt animals or plant crops worked much better than simply sowing on the next warm moist day, because nice days occur as often in October as in April. And it worked almost as well before it was discovered that the earth actually went around the sun, rather than the other way around.
The term big data vaguely summarizes the immense collections of organized past experiences made possible by the latest information technologies. These collections are foundations for our expectations of personalized medicine or self-driving cars, and already, by empowering Facebook or Google, they quietly but significantly impact our lives. Within big data, searching for predictive patterns requires specialized and increasingly complex statistical methodologies, for which analytics has become something of a buzzword.
Even more of a big data buzzword is “moneyball,” originally the title of an acclaimed book and movie that recounts how the Oakland Athletics baseball team succeeded, despite financial weakness, by embracing novel performance statistics as well as scouts’ judgments when making player decisions.1 Perhaps because of the dramatic tension between the cultures of statistical analysis and athletic competition, moneyball then became a general label for an emphasis on measurable quantities over subjective opinions when making organizational decisions. At least in baseball, where hundreds of millions of dollars can depend on the performance of one individual, the teams that most rapidly and effectively blended these two disparate cultures have indeed experienced the better records.
The success of moneyball is also my reason for deciding to write this memoir. For, as Moneyball and another noteworthy book, The Numbers Game, relate, I was heavily involved in the birth of baseball analytics, also called sabermetrics.2 During the 1970s, before fantasy games, personal computers, and Bill James’s incandescent writing triggered an explosion of interest in baseball analytics, almost its entire literature was letters and manuscripts I exchanged with Pete Palmer. One study in particular, on clutch hitting, became something of an enduring classic, among others summarized in John Thorn’s The Hidden Game of Baseball.3 These experiences primed me for a remarkable opportunity, to create and develop probably the first in-depth, pitch-by-pitch baseball information system while cofounding and then refounding STATS Inc., which today as STATS LLC is the worldwide leading provider of sports statistics, its little red logo appearing in the credits at the ends of many televised sporting events. There are a lot of origin stories yet to be told and a few details to be amplified and clarified.
Yet for me baseball was always a side interest, even if a very intense one for fifteen years. My real fifty-year career was founded on different big data activities, as a chemist who pioneered in the use and development of specialized analytics approaches, intended to guide the discovery of pharmaceutical drugs, collectively known as computer-aided drug discovery (CADD). Any international renown that I’ve enjoyed has resulted from that work, especially the creation of comparative molecular field analysis (CoMFA), whose popularity is attested by many thousands of publications citing its use and, I like to think, must somewhere have contributed to the all-too-rare discovery of a new medicine. And there is a second startup story of a company called Tripos, whose CADD software product, Sybyl, was the worldwide leader for twenty years, but which like most startups eventually stumbled and was absorbed, digested, and finally eliminated by a larger company, Certara. Therefore, with apologies in advance to any of you who are turned off by geeky scientific stuff (as opposed to geeky baseball stuff?), I must tell you something about drug discovery, computer-aided, with a few thoughts about how analytics and big data have become central to baseball and might be made more effective for drug discovery.
Receiving lifetime achievement awards in both baseball research and computational chemistry would seem to establish my credentials in these fields, but are these activities representative of big data? Admittedly, the quantities of their “recorded and organized experiences” (okay, from now on let’s just say “data”) were tiny compared to today’s big data galaxies. But in their day, these were new systems that stretched the limits of the available hardware. Wikipedia endorses my position, declaring that “big data . . . seldom [refers] to a particular size of data set.” Surely the pervasive effect of analytics on baseball is irreversible—just look at how the infielders are constantly repositioning themselves. And some association of big data with drug discovery also seems enduring because novel analytics tools are often thrown at the frustrating mysteries of drug discovery—though so far with only modest and scattered successes. However, it should be noted that it is not analytics but physics-based simulation, in which most CADD scientists have been trained, that underlies most of the popular chirping about “discovering drugs with computers.”
In any case, this book is primarily a memoir of a very lucky man who did not pursue some vision of wealth or fame and is as surprised as anyone at how well things turned out. I have always been much more of a Wozniak than a Jobs, taking life one step at a time, hoping for nothing more than earning a respectable salary for doing things I mostly enjoyed. Yet at several key moments, I just happened to be in the right place at the right time with the right skills and motivations. Along the way, my half century of grappling with the waves of advances in the underlying information technologies may be of some historical or nostalgic interest. I’ve also enjoyed interesting pursuits and adventures having little to do with baseball, drug discovery, or the computer. Finally, I’ve crossed paths with many well-known names, especially in baseball, and worked closely with some should-be luminaries, like the guy who while a high school student was probably the first to discover a recognizable form of the now omnipresent baseball statistic known as ops and has today taught more than three hundred different courses at Harvard, perhaps more than anyone else in its history, or a fellow Dixieland jazz trombonist who reported creating the world’s first operating system.
1. Lewis, Moneyball.
2. Schwarz, Numbers Game.
3. Thorn and Palmer, Hidden Game of Baseball.