The Science of Literature

When William Wordsworth composed “Lines Written in Early Spring” sitting by a brook in 18th century England (green bowers and periwinkle flowers on his mind), he could not have expected that his poems would be intensely studied two centuries later by a passionate PhD student at the University of California, Berkeley.

Josephine Miles’ seminal study of Wordsworth, originally published in 1942.

This student’s work would go on to shape the bedrock for a new way of literary analysis - now commonly known as “distant reading”.

Distant reading is explained by its opposition to “close reading”. Close reading is defined by literary circles as reading a text while emphasizing depth. It is a relentless examination of the text’s linguistic contents and the themes and meanings that constitute it - be it a novel, a poem, or an excerpt from a speech.

I certainly have mixed feelings about the words “distant reading”, which on their own are quite opaque. But understanding that distant reading was proposed as an opposite analogue to close reading helps provide a hint to the meaning of the term.

The Great Unread

If close reading is about reading and studying elements of a text as a closed system, then distant reading is about not reading them - as Franco Moretti, founder of the Stanford Literary Lab, explains. What Moretti is referring to is taking a bird’s-eye view in literary research, by applying data analysis and computational techniques to the understanding of literature.

Many people have read more and better than I have, of course, but still, we are talking of hundreds of languages and literatures here. Reading ‘more’ seems hardly to be the solution. Especially because we’ve just started rediscovering what Margaret Cohen calls the ‘great unread’.

- Conjectures on World Literature, Franco Moretti

Our lives are brief, and we have a limited amount of time on this earth. The aspiring reader contends with the vast amounts of text and human utterances that have been recorded over history. As much of a bookworm as one might be, mortality is the perennial killjoy. The ‘great unread’ seems destined to remain unread.

It is from this premise that many interesting challenges arise. How can we begin to conceive of meaningful connections within the sheer mass of text that has been recorded in human history? Can we draw epistemologically valid insights from the literary canon? How can we even talk about activities such as describing the evolution of book genres over time, when the validity of our conclusions are in perpetual jeopardy?

A frequency analysis of the term “皇帝” (‘Emperor’ in Mandarin Chinese) as it appears in the “二十四史” (24 Chinese Histories), using PhiloLogic. Source: The 24 Chinese Histories Project at the University of Chicago

This is where computational and quantitative techniques, made feasible by modern technology, enter the fray with some gusto. These tools provide us with the weapons with which we do battle with the great unread, with big data. They also enable what researchers propose to be the scientific study of literature, allowing one to assert and falsify theories, make comparisons, and replicate findings.

These qualities are what define distant reading as a distinct method of study - a quantitative way of looking at literature.

What is a "main character"?

Let us imagine that we are tasked with providing a definition for what a “main character” or a protagonist of a story is.

Many people are able to make conjectures about which personas in a novel are “main characters” or protagonists. This ability persists across different novel titles, different novel genres, and novels of different lengths (here, I acknowledge the omission of some types of experimental literature).

Months ago, I asked some folks how they would describe a “main character” in a novel, independent of the story being considered. Common ideas that came up included the centrality of the character to the plot, their relational importance to the other elements of the story, and how well the character represents the “main themes” of the book. A friend of mine provided an explanation in the context of J.R.R. Tolkien’s “The Lord of the Rings: The Return of the King”:

…the King here, is well, clearly Aragorn. But if you remove Gimli and Legolas from the story, the LoTR story still runs - perhaps without elvish intelligence and dwarfish humor.

- Donavan Cheah, Cybersecurity Engineer

Another described a protagonist as being more “central” to the narrative; the story is told from their perspective more often as compared to other characters:

The story revolves around these characters. Events, conversations, and emotions are told from the central perspective of these characters.

- Anonymous, HR Executive

The above responses provide great information, and can themselves prompt a worthy investigation. But perhaps some would disagree that the LoTR plot can do without Gimli and Legolas. We also know that some stories attempt to blur the lines between character perspectives. What does one mean by the “story still runs”? What does one mean by “central”? The million-dollar question here is: How can look for verifiable and consistent patterns across a much larger category of texts and/or readers? Addressing this would require us to achieve a greater level of generalizability.

A verifiable and reproducible study might involve something like a network analysis of the characters. We gather data from a defined source (say a film script), and then analyze the text as discrete chunks of data. We then create a definition - and this piece is key. This is where “the science” comes in: there is always a hypothesis - an idea that we wish to falsify or substantiate. And here is also where we see the limitations to this line of inquiry: as with any scientific investigation, the conclusions are tied to the parameters of the hypothesis, definitions, and the sample data that we collect.

A possible definition of a "main” character could be the persona with the most number of interactions with other characters in the story:

A network visualization of the characters from the 8 seasons of HBO’s Game of Thrones series. Does this picture hold up with your sense of who the main characters are? Source: Network of Thrones

The Network of Thrones project used fan-authored scripts of HBO’s Game of Thrones series as data, and added links between two characters based on five conditions:

They appear in a scene together.
They appear in a stage direction together.
They exchange dialogue.
One mentions the other.
Another character mentions both of them together.

Combining these rules and applying them to the GoT scripts produced the above visualization. Clearly, Tyrion, Jon, and Daenerys sit comfortably at the top when it comes to how central their characters were. (For more details on their analysis of the GoT TV series, definitely check their website out.)

A more basic analysis may just examine the raw counts of how often a character is mentioned in a text. Of course, this does not mean that a persona with a high number of mentions is definitely a main character - though it could certainly be a proposed definition of one. Another type of analysis could perhaps look at the number of “actions” each character takes, perhaps by counting the number of verbs contained within statements that are associated with them.

So, on the whole, being able to define these quantitative measures does four important things for the field of literature:

It provides comparable data. A basic example is comparing the word count frequencies and distributions between two or more texts. This sounds like a simple exercise, but is more useful than most people think. More complicated methods exist to perform tasks such as automated author attribution.
It gets at the possible mechanisms behind how individuals ground their understanding of a concept, as we have seen with the concept of a “main character”. We can then test these possibilities - theories - across different texts, genres, authors, and other dimensions - which a nice segue into the next point.
It allows us to test hypotheses, and falsify them. With sufficient amounts of quantitative data, we can run robust statistical tests.
Last but not least, when combined with other technologies such as Optical Character Recognition (OCR) and high performance computing, it allows us to work with unprecedently vast amounts of textual data.

Distant Reading, Digital Humanities, Content Analysis

I spent many winter afternoons in 2019 in a small classroom at the University of Chicago’s Beecher Hall - attending a class aptly titled “The Science of Literature”, taught by my graduate advisor Hoyt Long. It was in this space that I first learned about Josephine Miles - the doctoral student mentioned in the beginning of this piece.

These days, when we hear about techniques such as content analysis, natural language processing (NLP), or network analysis, what often comes to mind - even for computing veterans - are complex mathematical models or high-tech magic. It may be difficult to imagine that it all started with Miles’ idea to start counting words and mapping out their frequencies. In this sense, I was glad that Hoyt chose to begin the course by looking at Miles’ work - another important reason being the stark underrepresentation of contributions by women to the literary and linguistic academy.

It is worth bearing in mind that early in Miles’ career, before computers were widely accessible, counting words meant flipping through the pages and manually indexing them by hand. No mean feat for a doctoral candidate with limited funding. She eventually went on to become the first woman ever tenured in UC Berkeley’s English Department - yet another oft-overlooked achievement.

In her 1942 piece “Wordsworth and the Vocabulary of Emotion”, Miles examines Wordsworth’s use of named emotions in his poetry in relation to his poetic style, and presents readers with a series of tables that accompany her analysis.

Miles, J. (1942). In Wordsworth and the vocabulary of emotion (p. 171). University of California Press. Source: https://catalog.hathitrust.org/Record/004471361

In Table 4 above, we can see the beginnings of word stemming and lemmatization. Miles takes care to leave a footnote regarding nuances such as the inclusion/exclusion of certain forms of the same word. In Table 3, we observe the beginnings of context recognition, categorization of words (in this case, named emotion), and concordance.

Quantitative Adventures

If all that sounds familiar to you, then you are possibly tech-savvy or have seen these methods in action. Readers who are familiar with NLP, artificial intelligence, and machine learning techniques would be aware of modern technologies that take advantage of some of these word categorization and context recognition processes.

Your typical chatbot or voice assistant - your Siris and Alexas - would not be as helpful if it could not recognize that the utterances “shut it down” and “turn it off” are essentially the same command from the user’s stand point. To recognize this, the machine has to learn that within this specific context, both of these utterances should map to the same result. And that, is at its core a problem of automatic categorization and classification.

Word counting and categorization are also enablers of the beautiful visualizations shown below:

A visualization of “communities” of words and concepts in Genesis 1:1 of the Bible using the *textexture* tool. The different colors represent words that appear next to each other the most often, and the largest nodes represent words that connect multiple different communities! The first image is a map of the whole text. The second and third images are the connections captured by the words “man” and “woman” respectively. From: https://textexture.com

An interactive visualization of semantic word relationships using the *Visuwords™* tool. The graph displays how words and phrases relate to each other semantically. Source: https://visuwords.com/on-the-spot

The Science of Literature and Interdisciplinarity

It is my ardent belief that the quantitative and the qualitative are complementary methods, especially in humanistic investigations. And just like in any healthy relationship, each half would help the other grow and prosper. But my stance of course, is not universally held.

The emerging fields of digital humanities and digital studies have been mired in a battlefield - one maintained and exacerbated by polarized attitudes on both sides of the false humanist-technologist dichotomy. The kind of arguments that underlie the insidious “left-brained versus right-brained” myth in education resurface here with a vengeance. Critics of the quantitative study of literature talk about the limiting nature of computational studies on the types of conclusions being drawn, while some of distant reading’s defenders focus on the insufficiency of traditional reading techniques to approach specific questions - questions that traditional humanities never claimed to hold the key to in the first place. These assertions are not false in the strict sense - but articulating them as criticism overshadows a principal fact: that these hermeneutic gaps are precisely why a healthy marriage between both camps is beneficial.

The idea that both approaches attempt to answer different types of questions is a point that cannot be belabored enough. Put together, the quantitative and the qualitative can greatly empower the pursuit of the humanities, lending strength and nuance to the discipline in a world dominated by data and technology. Together, they can help us to answer new types of questions, or age-old questions in new ways.

I personally enjoy working with and fostering interdisciplinary teams in the workplace for this very same reason, and try to bring the same spirit to my mentoring process.