Digital Duel – Voyant vs. Antconc

Corpus Construction

My previous blog, posted on February 15, has since then incrementally grown with a narrower focus for better results.

Throughout the process of constructing my corpus, I have made great strides and a few errors as well. The best term to describe creating a corpus is the term “iterative”. That being said, I realized the magnitude of my task, and I decided to narrow my focus, for now, to compare the Renaissance and Enlightenment texts in my class, HUMN 150. As I precede toward the final weeks of class, I will add the remaining texts and perform the respective analyzes too (See Future Decisions).

The initial steps of creating a corpus are indubitably the hardest. The first step is asking which general question you would like to research. Then, obtaining the documents needed for that research. I decided to create a corpus of all the readings from my Comparative Humanities class, HUMN 150. Throughout the duration of our class, HUMN 100, I intend to compare the 2015 syllabus with the first syllabus from 2000. 

Cleaning my corpus was another difficult task. I was able to get the books from Project Gutenberg and one from Professor Faull; however, I acquired the supplementary readings in PDF form through my HUMN 150 instructor, Professor Shields.

I used Adobe Acrobat Pro to convert the PDFs into text files (.txt), and I saved them in my google drive. My google drive is organized by “Renaissance&Enlightenment” texts, “Text Files from Gutenberg,” and “PDFs.” With these folders, I can keep track of which text files I am using. Also, I keep notes on my corpus construction that indicate what I keep and delete in each text file. Then, I cleaned each file by using Spellcheck.net, text fixer.com, and text cleansr.com. Using these websites, I removed line breaks, paragraph breaks, HTML script, and extra white spaces. Additionally, I manually cleaned each file correcting spelling and removing footnotes, some chapter titles, names of authors, and page numbers.Google Drive General

 

Overall Research Question:

How did the syllabus from 2000 change in terms of genres and authors (gender differences)?

Also, the course was originally titled “Art, Nature, and Knowledge” and is now “Enlightenments,” what is the most accurate title, or what should it be?

General Questions: 

Are “God” and “knowledge” prevalent terms throughout the Renaissance and Enlightenment texts? What is dominant?

Is there a gender bias towards female and male authors? Do their writings have gender-preference pronouns?

Do the authors’ lexicon reveal that they are true humanists?

 

Differential Analysis & Analytical searches with Voyant and Antconc

Voyant was the first platform I performed an analytical search. It is visually-appealing software that allows you to upload your corpus and perform word frequency, collocation, and a multitude of other searches.

Some of Screenshot (45)my recent searches include the terms: “he” “she” “her” and “him”. I was looking for gender bias in my readings. I created a cirrus or word cloud using the maximum of 500 words. I noticed that “he” and “him” was overwhelmingly the most frequent term used throughout by corpus. (see Translation Problems)

 

Then, I put the terms “he” and “him” in Antconc, like “he|him” to combine search results. Antconc produced 2249 hits for both “he” and “him”.  As for “she” and “her,” there were only 176 hits.

Screenshot (70)Screenshot (71)

                    

 Screenshot (53)This is a differential search from Voyant looking at some KWIC (keyword in context). I used both “he|him” and “she|her” to research the gender issue in depth. Usually, “he” is used to refer to a general population. The most interesting case is the use of “her” and “she”. The lexicon surrounding these terms is significantly neScreenshot (52)gative, such as “folly,”, “mistress,” and “bitter”. Also, “her” and “she” are frequently used to replace “nature” – purity, “law” and “deliberations” – mutable. When I refer to purity, I am speaking of the “sexually pure good girl”. When I use the word “mutable,” I am speaking of the way in which men view women as an object, and how men suppress women into the image they want to see. 

On the other hand, the terms surrounding “he” and “him” allude to power. Some of the lexicon items surrounding these terms are “God,” and “Lord.”

 

Screenshot (72)This screenshot is a collocation of “her|she”. This also enables me to see the negative lexicon surrounding the terms “her” and “she” throughout all my texts. There are terms like “submission,” “prostituting,” and “virginity.” This lexicon gives the impression that women are represented in my texts only as pure or “dirtied” and as objects. The greater part of this collocate has a plethora of words with negative connotations and denotations.

 

Comparison of Voyant & Antconc

In this blog, I have differential searches from both Voyant and Antconc. Each platform has its own strengths and weaknesses. Voyant is useful for a brief and illustrative analysis of your corpus. It includes many different tools to view your corpus; however, it is easy for many scholars to challenge your research calling the illustrations “pretty pictures” and stating that they are nothing more. In fact, this is not true. They are visual representations of your corpus statistics. For example, you can put in a unique word like “God” and see the vocabulary density of this word throughout all your texts with a single tool- bubblelines, scatter plot, etc. Screenshot (50)

Screenshot (49)

Screenshot (73)Antconc allows you to view the statistics behind the illustrations of Voyant. It is easy to see the collocations of a word, like “God”. Moreover, Antconc analyzes the frequency of a word (hits) and the terms used around that word (collocates), allowing you to click on a term and see its context in each reading. Antconc also allows you to have a reference corpus. A reference corpus allows you to see the keyness of a certain group of texts with regards to another corpus. For example, I referenced my Enlightenment texts to my Renaissance texts to see the keyness of the Renaissance texts. There were obvious words like “prince” and “painter” instead of words that belong to the Enlightenment like “time” and “motion.”

 

Pragmatics

The platforms have given me great insight into the gender markers of my text. The usual stop words like “he” and “she” give me a chance to analyze the gender bias in the 2015 syllabus. Consequently, to answer the overall research questions I will need to upload the corpus of the syllabus from 2000. As of now, I can conclude there is a male bias in my texts and that there are no female authors in the Renaissance and Enlightenment texts.

The terms from the old title of the course, “Art, Nature, and Knowledge”, are prevalent in the text as well.

Translation Problems

The biggest issue with my corpus is that most of the texts are a translation from either German, Italian, Chinese, etc. Thus, I am not always certain that the term “he,” for example in German, was actually in the neuter form. If it was, I have to take into account the translator’s gender bias too. Also words like “Menschheit” in German, which means “mankind” is a feminine noun; however in English, it may be interpreted as a masculine inclusive noun. I will need to take the translations and the translator’s ability into account as I progress with my research.

Future Decisions

As I mentioned above, in the following weeks I will add the remaining texts from my HUMN 150 class and the texts from the syllabus from 2000. Then, I will be able to answer my lingering questions with more evidence. I believe that I am on the right track with my research. I am glad I have narrowed my focus and made the decision to gradually add material. I have learned that digital work takes time. If you wish to have a solid foundation for your corpus, you need to properly collect your material and clean them carefully. As of now, I am excited to learn new platforms and collect new analyses!

 

I Get by with a Little Help from Distant Reading

Distant Reading

Franco Moretti, an Italian literary scholar, pioneered a new research technique within the digital humanities known as “distant reading.” Distant reading, as Moretti told the New York Times, is ” understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data” (Schulz 3). It is also referred to as “textual/text analysis.”

Distant reading allows for an extraction of the text off the page and illustrated through digital tools. This provides a clear overall reading that is not always evident through close reading, the traditional process of reading materials.  Distant Reading is the ability to draw our attention from what traditional reading teaches and uncover the relation of patterns that emerge at a distance and close up. Tanya Clement refers to this process as if you are turning a magnifying glass upside down. As Clement states, it’s a method to “defamiliarize texts, making them unrecognizable in a way…that helps scholars identify features they might not otherwise have seen, make hypotheses, generate research questions, and figure out prevalent patterns and how to read them”(Clement 3). It is important to emphasis the new outlook text analysis can give us on an otherwise standard reading level. 

The practice of distant reading is becoming increasingly popular as new technologies emerge and questions are asked.

Distant Reading Changing Perspectives

Pretend you have just finished all of Shakespeare’s plays. You find each to be fascinating in its own right, but you forget the main concepts. Distant reading allows you to pull each play apart, extracting hidden information without having to read every play for the second time. Then, you can compare Hamlet, Macbeth, Othello, etc. together and make conclusions on each text or Shakespeare’s ideas.

A digital humanist, Ted Underwood, claims that you can “identify distinctive vocabulary” (Underwood 15). Therefore, this would give you an opportunity to pick out Shakespeare’s most used words and analyze his diction/lexicon for each play. Consequently, you can make inferences on why Shakespeare wrote the way he did or what his ultimate goal was in a piece of writing.

For example, this is my Wordle on Emily Dickinson’s poem entitled “I cannot live with You.”

wordle 3

This picture depicts all of the words in her poem without commonly used English words. The most dominant word in the Wordle is presumably “Life.” From this I can make some inferences and conclusions about this poem. For instance, the main idea surrounding the poem is “Life” according to the Wordle. However, just looking at the words without context makes it difficult to understand the tone of the text. That is when “differential reading” becomes important.

 

Differential Reading

Clement explains differential reading as “…close and distant reading practices as both subjective and objective methodologies” (Clement 2). Thus, close reading allows a critical analysis of a literary work, while distant reading acts as an “upside down magnifying glass,” illustrating hidden patterns to scholars. 

For instance with my Wordle of Emily Dickinson’s poem, it would help to read it myself and synthesis my own interpretation of the poem. Thereafter, I can put together my ideas with the digital techniques to make a more accurate hypothesis.

Another digital humanist, David Hoover, states that “Investigating how and the extent to which authors differentiate the voices of characters or narrators…” (Hoover 3) is possible. This could either be in a novel, play, poem, etc. Taking Hoover’s point into consideration, I could compare multiple Dickinson poems and compare the tones and rhetoric in each. Furthermore, I could even break a poem up into each stanza and look for different tones between the stanzas. The possibilities are endless with distant reading; there is always new information and approaches to discover.

Challenges to Distant Reading

There are some significant disadvantages to distant reading.

  1. Copyright Laws – As Hoover stated, “For texts not available in digital form, an electronic text can be created by scanning and OCR. Unfortunately, it is not entirely clear that this is legal for texts in copyright” (Hoover 13).
  2. Finding the Text – It is incredibly hard to find some texts online. Even if you are so fortunate to find your text, sometimes there are different editions, authors, and publishers. It can be extremely difficult to choose the text that best suits your research.
  3. Texts not in Digital Form – In this case, you can perform an OCR scan. Although, you must keep in mind copyright laws. Moreover, if there are any additional drawings or markings on the original, they might not copy.
  4. Sentiment – It is difficult for a computer to distinguish between emotions. As the reader, you have your own perspective and develop emotions from that.
  5. Expansiveness of Archives – The collections of certain digital archives may be too small for a complete analysis.

This is only a small list of disadvantages. Unfortunately, there are some more, but in most cases the pros outweigh the cons.

 

Example of a Challenge to Distant Reading

First, I did a Wordle of the Preamble of the United States Constitution.

wordle

Next, I used the three most dominant terms, “establish,” “United,” and “States.” However when I entered these three words into the N-gram, I put “United” and “States” together – advantage of close reading.

Screenshot (31)

I set the years between 1700 and 2008, to see the frequency of the terms used in literature.

Then, I used culturnomics or bookworm: ChronAm, to plot the same terms over time. However, Bookworm: ChronAm, was not reading “United States” even though it said you could enter a 2-gram (two word phrase). So I graphed “establish.”

Screenshot (32)Screenshot (33)Then I tried to graph “UnitedStates.” I did get a graph, but when I looked at the source texts, the only word highlighted was “United” in the articles. This shows that not all digital tools will work properly with what you want to do. Screenshot (35)

Screenshot (34)Finally, I tried “The United States” and received an oddly shaped graph. The articles highlighted words like “here” and “mistakes” which have nothing to do with “The United States.”

 

Screenshot (36)

Summary of Distant Reading

 Text analysis is based on the use of both subjective and objective practices. While, the objective practices require a mathematical output of word frequency, etc, there is a certain subjectivity in relation to interpreting the meanings from a graph, based on knowledge of history, philosophy, etc. – unquantifiable subjects. 

Underwood describes it as ” …an interdisciplinary conversation about methods…” (Underwood 5). He also states that you may get sucked in and come across new territory not yet discovered. Fortunately, that is where the fun lies, daring to climb to new heights and to make superior breakthroughs.

Works Cited

Schulz, Kathryn. “What Is Distant Reading?” The New York Times. The New York Times, 24 June 2011. Web. 31 Jan. 2016.

Clement, Tanya. “Literary Studies in the Digital Age.” Literary Studies in the Digital Age. 2013. Web. 31 Jan. 2016.

Hoover, David L. “Literary Studies in the Digital Age.” Literary Studies in the Digital Age. 2013. Web. 31 Jan. 2016.

Underwood, Ted. “Seven Ways Humanists Are Using Computers to Understand Text.” The Stone and the Shell. 04 June 2015. Web. 31 Jan. 2016.

 

“Bookworm.” Bookworm. Web. 31 Jan. 2016.

“Google Ngram Viewer.” Google Ngram Viewer. Web. 31 Jan. 2016.

“Wordle – Beautiful Word Clouds.” Wordle – Beautiful Word Clouds. Web. 31 Jan. 2016.