How Well Does Your Computer “Read” You?

What if your computer could read your emotions?

The question above has been proposed for many years; however, recently, there has been more attention focused on sentiment and machine analysis. Upon hearing this question, some people enter a state of panic and distress. They assume their computer knows more than they do, thus taking control from human to mechanical hands. Nevertheless, I believe that machines “attempt” to read emotion, but emotion is much more complex than a machine, and even a human, could define.

Emotionless Machines

According to Stephen Ramsey, “In an age when the computer itself has gone from being a cold arbiter of numerical facts to being a platform for social networking and self-expression,” it is important to understand the programs as a method to facilitate self-expression and networking (Ramsey 81). The programs used online do not create sentiment for you, as the writer, you must create emotion using your language. But can they recognize emotion?

For example, text messenger or email allows you to talk to someone without having a face to face or ear to phone conversation. You can type what your are feeling and send it to that person. However, how can the person receiving your message understand the tone or inflection of your voice from only text? It is up to the receiver to interpret your message. This is one example of Sentiment Analysis within the Digital Humanities. You can allow the computer to determine systematically the sentiment of the text, but it is up to the researcher to know the material beforehand and make solid conclusions based on both close and distant reading.

Text and email messages adopted emojis – “any of various small images, symbols, or icons used in text fields in electronic communication (as in text messages, e-mail, and social media) to express the emotional attitude of the writer…”(Merriam-Webster Dictionary). These allow the receiver of a text to understand the sender’s connotation of the message by the expressions of the emoticons/emojis.

Is it possible for computers/machines to read emotion?

There are algorithms and platforms that can help a computer read emotion. The algorithms are “trained” by experimentation. One way an algorithm can identify the sentiment of a document or sentence is through a vocabulary. The computer selects the most frequent words from the document or sentence and categorizes them by adjectives, verbs, and negation. The computer is using context clues to locate the emotion. Another way in which an algorithm can identify sentiment is through applying machine learning (ML) techniques. It may treat it as a classification problem, compile the data and set up features to recognize sentiment. Thus, it is possible for machines to read sentiment systematically, although, the question lies with – how well can machines read emotion?

Within the Digital Humanities, we look to step back from the close reading of a text and observe it from a distance or “in another light.” From this we are able to draw differential conclusions. Similarly, the data found with sentiment analysis needs a human to interpret and categorize it. As Ramsey states, “Such numbers are seldom meaningful without context, but they invite us into contexts that are possible only with digital tools” (Ramsey 75). Also, it is important to remember that algorithms are derived from humans.

Topic Modeling 

Topic Modeling is a way to strip sentences into individual words. It is possible to discern the semantic meanings of individual words depending on context. The machine chops your text into different parts. The machine is looking for a specific pattern. It looks for the co-occurrence of words within the documents and can compare them to a set of different documents. 

When I went through and highlighted my own version of Lincoln’s Gettysburg Address, I realized that my own experiences, either influenced by culture, religion, education, etc., determined how I decided if a word was related to governance (in blue) or war (in red). Likewise, after Ramsey’s students realized their mistakes in categorizing the books of word density, he stated, “They have arrayed the objects of their intellectual life in categories that correspond, among other things, to the cultural penumbras in which texts are disseminated and taught” (Ramsey 73). The manner in which a teacher or professor discusses a book with his/her students can have ramifications with the students’ perceptions of the book.Screenshot (3)Screenshot (2)

The topic modeling system has its own experience through a set of documents. Using the program, the bigger the corpus the better it will be. The program will have more reliable information to determine the patterns of the words. Small sample sizes are not ideal, the program will not accurately be able to locate any patterns.

There are also problems with topic modeling. The program can believe it is finding patterns where there are no patterns to be found. You may attempt to extrapolate meaning from the data you get from topic modeling, when in fact you are unable to draw sound conclusions.

Screenshot (9)

It is clear that my own list of topics resemble the keywords not only from my older word cloud, but also the keyness I tested for in Antconc. Each topic line, I had predicted the most significant text and was right every time. It is very interesting how the computer can generate texts similarity through “topics.” Topic Modeling helped me understand just how close in relation some of my keywords throughout my texts are.

Jigsaw & AlchemyAPI

Screenshot (5)I received some interesting sentiment analysis from Jigsaw. For example, Pico’s Oration on the Dignity of Man was the “happiest” of my documents. Jigsaw was correct with identifying key words and even more surprising with the summaries. Additionally, Kant was categorized as the “saddest/angriest” of my texts. I know from reading both of these texts that Pico has a “happier” tone than Kant. However, they are not necessarily “happy” texts. Thus, you have to look at the vocabulary that Jigsaw was using to determine the sentiment analysis.

Upon closer look in Pico’s book, he uses terms such as “kindness,” “peace,” “natural,” and “holy.” Where as Kant’s Screenshot (7)reading has terms like “unworthy,” “tolerance,” and “haughty.” It is imperative that you have read the materials you are using for sentiment analysis, so that you can utilize differential techniques when analyzing the data.

 

Digital Duel – Voyant vs. Antconc

Corpus Construction

My previous blog, posted on February 15, has since then incrementally grown with a narrower focus for better results.

Throughout the process of constructing my corpus, I have made great strides and a few errors as well. The best term to describe creating a corpus is the term “iterative”. That being said, I realized the magnitude of my task, and I decided to narrow my focus, for now, to compare the Renaissance and Enlightenment texts in my class, HUMN 150. As I precede toward the final weeks of class, I will add the remaining texts and perform the respective analyzes too (See Future Decisions).

The initial steps of creating a corpus are indubitably the hardest. The first step is asking which general question you would like to research. Then, obtaining the documents needed for that research. I decided to create a corpus of all the readings from my Comparative Humanities class, HUMN 150. Throughout the duration of our class, HUMN 100, I intend to compare the 2015 syllabus with the first syllabus from 2000. 

Cleaning my corpus was another difficult task. I was able to get the books from Project Gutenberg and one from Professor Faull; however, I acquired the supplementary readings in PDF form through my HUMN 150 instructor, Professor Shields.

I used Adobe Acrobat Pro to convert the PDFs into text files (.txt), and I saved them in my google drive. My google drive is organized by “Renaissance&Enlightenment” texts, “Text Files from Gutenberg,” and “PDFs.” With these folders, I can keep track of which text files I am using. Also, I keep notes on my corpus construction that indicate what I keep and delete in each text file. Then, I cleaned each file by using Spellcheck.net, text fixer.com, and text cleansr.com. Using these websites, I removed line breaks, paragraph breaks, HTML script, and extra white spaces. Additionally, I manually cleaned each file correcting spelling and removing footnotes, some chapter titles, names of authors, and page numbers.Google Drive General

 

Overall Research Question:

How did the syllabus from 2000 change in terms of genres and authors (gender differences)?

Also, the course was originally titled “Art, Nature, and Knowledge” and is now “Enlightenments,” what is the most accurate title, or what should it be?

General Questions: 

Are “God” and “knowledge” prevalent terms throughout the Renaissance and Enlightenment texts? What is dominant?

Is there a gender bias towards female and male authors? Do their writings have gender-preference pronouns?

Do the authors’ lexicon reveal that they are true humanists?

 

Differential Analysis & Analytical searches with Voyant and Antconc

Voyant was the first platform I performed an analytical search. It is visually-appealing software that allows you to upload your corpus and perform word frequency, collocation, and a multitude of other searches.

Some of Screenshot (45)my recent searches include the terms: “he” “she” “her” and “him”. I was looking for gender bias in my readings. I created a cirrus or word cloud using the maximum of 500 words. I noticed that “he” and “him” was overwhelmingly the most frequent term used throughout by corpus. (see Translation Problems)

 

Then, I put the terms “he” and “him” in Antconc, like “he|him” to combine search results. Antconc produced 2249 hits for both “he” and “him”.  As for “she” and “her,” there were only 176 hits.

Screenshot (70)Screenshot (71)

                    

 Screenshot (53)This is a differential search from Voyant looking at some KWIC (keyword in context). I used both “he|him” and “she|her” to research the gender issue in depth. Usually, “he” is used to refer to a general population. The most interesting case is the use of “her” and “she”. The lexicon surrounding these terms is significantly neScreenshot (52)gative, such as “folly,”, “mistress,” and “bitter”. Also, “her” and “she” are frequently used to replace “nature” – purity, “law” and “deliberations” – mutable. When I refer to purity, I am speaking of the “sexually pure good girl”. When I use the word “mutable,” I am speaking of the way in which men view women as an object, and how men suppress women into the image they want to see. 

On the other hand, the terms surrounding “he” and “him” allude to power. Some of the lexicon items surrounding these terms are “God,” and “Lord.”

 

Screenshot (72)This screenshot is a collocation of “her|she”. This also enables me to see the negative lexicon surrounding the terms “her” and “she” throughout all my texts. There are terms like “submission,” “prostituting,” and “virginity.” This lexicon gives the impression that women are represented in my texts only as pure or “dirtied” and as objects. The greater part of this collocate has a plethora of words with negative connotations and denotations.

 

Comparison of Voyant & Antconc

In this blog, I have differential searches from both Voyant and Antconc. Each platform has its own strengths and weaknesses. Voyant is useful for a brief and illustrative analysis of your corpus. It includes many different tools to view your corpus; however, it is easy for many scholars to challenge your research calling the illustrations “pretty pictures” and stating that they are nothing more. In fact, this is not true. They are visual representations of your corpus statistics. For example, you can put in a unique word like “God” and see the vocabulary density of this word throughout all your texts with a single tool- bubblelines, scatter plot, etc. Screenshot (50)

Screenshot (49)

Screenshot (73)Antconc allows you to view the statistics behind the illustrations of Voyant. It is easy to see the collocations of a word, like “God”. Moreover, Antconc analyzes the frequency of a word (hits) and the terms used around that word (collocates), allowing you to click on a term and see its context in each reading. Antconc also allows you to have a reference corpus. A reference corpus allows you to see the keyness of a certain group of texts with regards to another corpus. For example, I referenced my Enlightenment texts to my Renaissance texts to see the keyness of the Renaissance texts. There were obvious words like “prince” and “painter” instead of words that belong to the Enlightenment like “time” and “motion.”

 

Pragmatics

The platforms have given me great insight into the gender markers of my text. The usual stop words like “he” and “she” give me a chance to analyze the gender bias in the 2015 syllabus. Consequently, to answer the overall research questions I will need to upload the corpus of the syllabus from 2000. As of now, I can conclude there is a male bias in my texts and that there are no female authors in the Renaissance and Enlightenment texts.

The terms from the old title of the course, “Art, Nature, and Knowledge”, are prevalent in the text as well.

Translation Problems

The biggest issue with my corpus is that most of the texts are a translation from either German, Italian, Chinese, etc. Thus, I am not always certain that the term “he,” for example in German, was actually in the neuter form. If it was, I have to take into account the translator’s gender bias too. Also words like “Menschheit” in German, which means “mankind” is a feminine noun; however in English, it may be interpreted as a masculine inclusive noun. I will need to take the translations and the translator’s ability into account as I progress with my research.

Future Decisions

As I mentioned above, in the following weeks I will add the remaining texts from my HUMN 150 class and the texts from the syllabus from 2000. Then, I will be able to answer my lingering questions with more evidence. I believe that I am on the right track with my research. I am glad I have narrowed my focus and made the decision to gradually add material. I have learned that digital work takes time. If you wish to have a solid foundation for your corpus, you need to properly collect your material and clean them carefully. As of now, I am excited to learn new platforms and collect new analyses!

 

Corpus Creation In Action

Initial Thoughts

Creating a corpus is not the easiest project; it requires time, patience, and a lot of effort. The process begins by asking which general question you would like to research. At first, it can be daunting to narrow your research to a specific category. I started by thinking about authors and genres that I enjoy reading. I wrote down all my ideas and considered the different approaches I could take. Eventually, I thought about researching certain literary writers – Emily Dickinson, Shakespeare, Thoreau, Emerson. Then, I picked out philosophical interests I have – Existentialism, Objectivism, Platonism. However, I wanted to save those ideas for a later project.

Instead, I decided to create a corpus of all the readings from my Comparative Humanities class, HUMN 150 Enlightenments. Also, I want to compare the new syllabus from 2015 to the first syllabus from 2000. My goal is to see the transition the course has made in genres, authors (gender), and number of texts. Also, I hope to do a sentiment analysis to see if most of our texts are positive, negative, or neutral.

HUMN 150 highlights some of the most important intellectual, political, and literary trends from the European Renaissance to the beginnings of “modernity” in the late 19th century. There are fourteen supplementary readings along with ten books from the 2015 syllabus.

The books on the 2015 syllabus:

  1. Oration on the Dignity of Man by Giovanni Pico della Mirandola
  2. The Prince by Niccolò Machiavelli
  3. The Essential Galileo by Galileo Galilei
  4. The Narrow Road to the Deep North by Basho
  5. Discourse on the Origin of Inequality by Jean-Jacques Rousseau
  6. Frankenstein by Mary Shelley
  7. A Narrative of the Life of Frederick Douglass by Frederick Douglass
  8. The Communist Manifesto by Karl Marx and Friedrich Engels
  9. The Origin of Species and The Descent of Man by Charles Darwin
  10. The Home and the World by Rabindranath Tagore

 

It was extremely difficult to find the supplementary readings online; so, I asked Professor Shields, my HUMN 150 instructor, for a PDF copy of each reading. Then, I was able to get Oration on the Dignity of Man from Professor Faull in a word document. For the other nine books I used Project Gutenberg, a digital library with free eBooks. However, there were issues with the books from Project Gutenberg. The books were not the exact translations and editions we are using for class. This could have a meaningful impact on my research.

Problems with Wrong Translations

  • Word Cloud – The frequency of words could be different depending on the style of the translator’s lexicon.
  • The word density will be different depending on the number of words in each text.
  • There will be disparities in rhetoric. If I choose to look at literary devices like: alliteration, anaphora, and allusion.

Current Creation

Currently, I am cleaning and parsing my texts. I am almost finished removing filler words, page numbers, and extra spaces. Also, I need to scrape the PDF files of the supplementary readings. Then, I need to OCR the old syllabus’s texts. After I finish those three processes, I will have my corpus completed and prepared for textual analysis.

Finalizing My Corpus

I am keeping a metadata sheet of all the data I am using on Google Sheets. I know it is crucial to pay attention to detail and keep myself organized throughout this process. I will be thinking about the translation and edition problems along the way, and I will record my findings appropriately. I hope to discover gaps within the two syllabi like missing genres and gender preferences in authors. Also, I would like to concentrate on certain texts like Oration on the Dignity of Man and Leonardo Da Vinci’s Notebooks to illustrate similarities and differences between the thought processes of both writers. Even though creating a corpus is an arduous task, the new discoveries you can achieve are groundbreaking.