What is Distant Reading and how does it help us?

Distant reading is a concept first raised by Franco Moretti, who said distant reading is “understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data.”[1] He believes that literature scholars cannot reveal the whole picture of literature work by only reading a small number of existing books. They have to split up into groups and spend large amount of time reading different text and then gather the analyses together to conclude the patterns of different literature development. But with the existence of distant reading, scholars are able to explore and discover different ways to identify and understand texts.

Distant reading is not designed to bring brand-new understandings to literature work beyond the current ones but to provide different understanding that might be ignored or disregarded by human beings. As Hoover states in his Text Analysis article, “Computer-assisted textual analysis is neither a panacea nor substitute for sound literary judgment, but its ability to refine, support and augment that judgement makes it an important analytic method for literary studies in the digital age.”[2] Human brains and computer work in completely different ways. Each of them uses different methods and focuses on different aspects when analyzing so that the result would be completely different. Human reading and distant reading (reading by computers) could complement each other’s extremely well. The “big data” analysis of literature and the personal “small reading” could collaborate and communicate. Therefore, text analysis by computers is one of the important branches of digital humanities that trends so much recent years.

Distant reading provides us with a whole and well-rounded perspectives and pictures to the literature with larger sizes, even different backgrounds and genres. According to Clement, she finds that “data-mining procedures proved to be productive in initially illuminating complex structural patterns that helped [her] discern those underlying patterns.”[3] Text analysis is based on word frequencies and word collocation. People usually don’t pay attention to preposition words, pronouns and articles that have no relations with the meaning and understanding of the work. Even if people do pay attention to words, they will not remember the patterns of their occurrences, concordances or frequencies and not even say to understand their roles in the literature language diction, syntax and structures. It is extremely difficult to analyze literal syntax or writing styles. As Hoover states, “words have the advantage of being meaningful in themselves and in their significance to larger issues like theme, characterization, plot, gender, race, and ideology.” Distant reading and text analysis exclude subjectivity to the analysis of the work. It “[defamiliarizes] texts, making them unrecognizable in a way (putting them at a distance) that helps scholars identify features they might not otherwise have seen, make hypotheses, generate research questions, and figure out prevalent patterns and how to read them.” [3] This idea also corresponds my discovery when I used Wordle along with Google Ngrams to plot the frequencies of words in Martin Luther King’s “I have a dream” speech and the Declaration of Independence. These two graphs generated by Wordle give clear and vivid images of what their main topics are and what the strong intentions are the authors.They reinforce our understanding of the text with powerful and direct statistics backup, and also bring cultural conceptual expectations. The main topic of King’s speech is to call for freedom, justice and equality in American society and the main purpose of the Declaration of Independence is to enforce the importance of the government. Even with a small number of words as input for the program, we could get these strong demonstration. It is possible that given a much larger input, Wordle could provide us with an unbelievable result. Similarly, Google Ngram uses these subtle changes in the word occurrences during the history and provides us with an interesting and profound result that calls for our thinking and explanations.

Screen Shot 2016-01-27 at 1.35.42 PMIndependence

Screen Shot 2016-01-29 at 1.51.01 PM

Compared to traditional reading (close reading) that focuses on only small amount of literature with same genres or era, distant reading helps scholars to identify many more patterns, similarities and differences.Text analysis makes it possible for us to answer the questions related to history of literature, for instance, how to distinguish between American literature and British literature, what is the most notable difference in styles between American literature and British literature, how to distinguish novels and poems faster, how to distinguish the work of male authors and that of female authors, and how to identify the work of anonymous authors or the unknown authors due to the loss of record, etc. In the readings, Hoover includes his finding of the differences in diction between male and female authors [2] and Underwood shows his simple imaginary statistical model that distinguishes pages of poetry from pages of prose [4]. They prove that distant reading is powerful enough to distinguish some patterns that seem impossible for scholars to find or might take them years to discover.

Distant reading makes the field of humanities dynamic and energetic. It is an interdisciplinary approach and conversation among various fields such as social sciences, humanities, computer science, sociology, statistics, literary history, etc., according to Underwood [4]. As more fields get involved in the discussions, the finding will become well-rounded and diverse. In the past, only humanities and social scientists worked on how to analyze and understand some texts or work, but now gradually, scholars from a number of different fields gather together to focus on one topic and contribute their ideas, as if the world gathers and collaborates together for one goal. I believe under the intelligence and strength of diverse collaboration, there would be huge changes in not only the enhancements of our understanding of texts but also cultural equality.

Although distant reading has a promising long term prospect, it brings a number of challenges and defects. First, there is copyright restriction to digital text. And also because of the limitedness of OCR technology, the recognition of text is not perfect. Second, text analysis could indeed help scholars find problems very quickly but it could not provide rational explanations or corresponding solutions which extremely limits its application. Besides, individual literature work is often the focus of humanities scholars and social scientists. Massive use of big data analysis could hinder the important patterns of each individual literature work. Additionally, it restricts the development critical thinking skills and innovativeness. There is no doubt that distant reading would continue to be a popular trend in academia but it would not replace the traditional studies of humanities. It could complement and collaborate with traditional methods to bring ideas and discovery.

 

[1] Schulz, Kathryn. “What Is Distant Reading?” The New York Times. 2011. Web. 30 Jan. 2016.
[2] Hoover, David L. “Text Analysis.” Literary Studies in the Digital Age. 2013. Web. 30 Jan. 2016.
[3] Clement, Tanya. “Text Analysis, Data Mining, and Visualizations in Literary Scholarship.” Literary Studies in the Digital Age. 2013. Web. 31 Jan. 2016.
[4]  Underwood, Ted. “Seven Ways Humanists Are Using Computers to Understand Text.” The Stone and the Shell. 2015. Web. 31 Jan. 2016.

Cultronomics

Screen Shot 2016-01-29 at 1.50.34 PM Screen Shot 2016-01-29 at 1.51.01 PM

The advantages of bookworm:

We could discover and explore the exact and original documentations and their contents that contain the word that is searched for. Google Ngram only provides time interval search for the documentations. By using bookworms, we could also make the X-axis of the graph to be published months, years and even days. Similarly, the Y-axis could be changed to % of words, % of texts, word count and text count.

Screen Shot 2016-01-29 at 9.00.51 PM

The disadvantages of bookworm:

There is always a spike around 1840 (for most of the words that I searched for). My guess is that there were limited resources for 1840 and the based resources were limited to specific topic so that the percentage is very high as long as there were a few occurrences of the word. Besides, bookworm focuses only on American newspapers but Google Ngrams focus on various books.

The advantages of Google Ngrams:

It provides information for the most recent literature. We could search for literature work from 1800 to 2008 but for bookworm, we could only search for the interval from 1840 to 1920. We could change the language of the corpus. We could easily embed the graph to a website. Ngrams also allow us to specify if the word being searched is a noun or a verb or an adjective. Most amazing part is that there is a function called Ngram Compositions which could allow the users to combine the count of the words they want in one graph.

Categories
ngrams

Ngrams Post

King's Speech

  • Which words are dominant? Which are subordinate? What cultural and conceptual expectations does this visualization of King’s speech raise?

Dominant words include freedom, dream, nation, together, justice, etc.

Subordinate words include girls, pass, path, etc.

Cultural and conceptual expectations raise including justice and equality between black people and white people. The main topic of King’s speech is to call for freedom and equality of American society. King tried to bring people together.

  • Enter the dominant terms from King’s speech, language English, period 1800-present day. What does the graphical representation tell us?Look at the time periods underneath and click on the peak periods. What are the source texts?Now, enter other dominant terms. What does the graph show you? Can you think of some explanations for this change?Screen Shot 2016-01-27 at 7.48.16 PM The occurrences of freedom raise around between 1925 and 1960 and gradually decrease after 1960.Screen Shot 2016-01-27 at 7.50.35 PM The word dream keeps increasing from 1800. The concept of dream attracts people more and more in the present. Screen Shot 2016-01-27 at 7.50.44 PMThe mentions of “nation” gradually decreases a lot from 1800. My explanation is that the concept of nation is not weighted as much as it was in the past.

Independence

  • What terms are dominant? Now create an Ngram with those two terms.  What does the graph show you? Can you think of some explanations for this change?

Dominant terms include government, powers, happiness, right, mankind, history, etc.

Screen Shot 2016-01-27 at 7.48.46 PM

The word “government” was mentioned fairly according to the Ngram. The frequencies of it decreased a little from 1860 to 1940. My explaination for it is that during 1860 and 1940, there were Civil War, World War I and World War II. The whole world was in a state of chaos and countries fought with each other so that there was not many things happened related to government since wars were dominant at that time.

Screen Shot 2016-01-27 at 7.48.02 PM

This Ngram of the word “powers” indicates the decrement of powers in the literature work. I think it is because the society gradually moves towards equality and balance so that “powers” is not mentioned much in the present and authors don’t tend to discuss, complain or comment about “powers” which is why it appears less and less recent years.

  • Extra credit:

http://www.davidicchiasmus.com/blog/authors-non-lds/martin-luther-king-jr-dream

What prosodic elements does the author of this site identify?  In what way does this add to the power of King’s speech?

The author of this site identifies repeated thematic pattern. The contents of the speech gets deeper and deeper from the topic of “the greatest demonstration for freedom”. It makes King’s speech easy to follow.