November 2, 2007

From batting averages to the Bard, statistical theories apply

Featured Image

Stanford’s Bradley Efron, Ph.D., left, shares a laugh with Frank Harrell Jr., before giving last week's Discovery Lecture. (photo by Dana Johnson)

From batting averages to the Bard, statistical theories apply

How many words did Shakespeare know but not include in his texts?

A certain statistical theory “allows us to peer into Shakespeare's mind and count the words that he knew but didn't use…this is an irresistible temptation,” Bradley Efron, Ph.D., said at last week's Discovery Lecture.

Efron, the Max H. Stein Professor of Humanities and Sciences, and professor of Statistics at Stanford University, took the audience on a historical tour of statistical theories, using examples like Shakespeare's “missing words,” baseball batting averages, and prostate cancer data to illustrate “Bayesian” and “frequentist” approaches, and a modern compromise between the two called “Empirical Bayes.”

Statistics, Efron said, is the mathematical theory of learning from experience, especially the kinds of experience that arrive a little bit at a time — a baseball player's success or failure in a single at-bat, for example, or an individual cancer patient's success or failure on a new trial drug. Statistical theories make inferences based on those bits of information.

“Information is a very abstract concept, and that's one reason that it's hard to describe what statisticians do,” Efron said. It's about “ferreting out information, which always seems to try to hide from the eyes of scientists.”

Efron described the ongoing philosophical battle over information collection between the Bayesian and frequentist camps, and he noted that both ideas “seem to be getting a little creaky.”

New scientific technologies are producing huge datasets that require thousands of simultaneous inferences. “These new scientific situations change the game of statistics quite a bit,” Efron said. He noted the need for a new modern statistical approach, perhaps along the lines of Empirical Bayes.

After all, the “magic formula” of Empirical Bayes allowed Efron and colleagues to put a number on Shakespeare's missing words. They estimated that the great poet and playwright knew more than 35,000 words that he did not include in his writings, in addition to the 31,354 distinct words he did use.

“We claimed we had doubled Shakespeare's vocabulary,” Efron quipped.

For a complete schedule of the Discovery Lecture Series and archived video of previous lectures, go to