Sentiment Analysis

Machines can read emotions if their sentiment lists and grading are tailored towards a certain type of work. A program cannot be trusted wholeheartedly  or blindly and the user has to have prior knowledge of the corpus before running it through a program in case the machine does make a mistake. Prior knowledge allows the user to understand if a result is skewed by s writing style or time period and it also helps the user understand the context of why a text has a certain sentiment. It is also important to ensure that the list of words the program used for “happy” and “sad” or “calm” and “angry” analysis are relevant to the specific corpus and to create a separate list or modify the existing one to have the best results. For students who may be using translated or older texts it is necessary to find or create a list that will have words from that time period or that takes into account a translators choice to translate text into an older, unused English when a program’s list only has “new” English.Untitled-1

Programmers can create various codes that take into account the different ways sentiment can be read in a text. It can analyze a text using word frequency alone or word frequency and collocations, which may make it more accurate. In class we spoke about taking the word “not” into account: a document can say “not tragic” and “not angry” but if a program considers “not” a stop word, it will not read the opposite meaning of the sentence. The programmer may also want to take into account colloquial sayings that do not explicitly have one meaning: a computer without specific program is as clueless about US sayings as a foreigner misunderstanding when they are told to “break a leg”.  Ramsey discusses creating a program that analyzes a text by “even distinguish [a] noun from the verb” for a word like “love” as well as using programs to find the richness of documents in a corpus and to “rank them according to ‘vocabulary richness’ (defined as the largest number of different words per fifty-thousand-word block)”.

Another possibility for a program is creating or using one that “learns” from experience. Like Mallet which works best and has more accuracy with bigger texts, creating a sentiment analysis program that adds to its database with every search might be useful, although this may result in disasters like Microsoft’s “Tay” twitter personality. “Tay “was a Microsoft project that was supposed to imitate a teenage girl to ease customer service and make it seem more personable. Tay was quickly corrupted by internet trolls and in the course of a day went from tweeting her love for humans to racist Nazi supporting tweets. Her story shows that even the best program can have difficulty mimicking human thoughts and conscience. This is an argument used against sentiment analysis, It is hard to trust the results of something as cold and unemotional as a computer when its analyzing something as “human” as emotions. We trust our computers to give us accurate results for maths and science, shown clearly in a students dependence on his or her calculator but it is difficult for people to accept the same calculations for emotion. According to Ramsey this is because of “fears of an inhumanistic technology” and the fact “that we might ‘lose the text’ frightens many”.

Reflection

The project I have been working on this semester is an analysis of a list of top 10 feminist speeches from Marie Claire‘s: the 10 Greatest Speeches of all Time by 10 Inspirational Women. The list is not a very homogenous list with a speech dating back to 1588 and the most recent one in 2012. I wanted to know if there was a pattern that I could find in the speeches, something that would connect them to demonstrate why these speeches ended up becoming the top speeches when there are so many contenders to choose from. I also wanted to see if there was a change in the way the women spoke  as they did gain more rights and equality in society.

My project began very differently, I first decided on civil rights as a whole, toying with ideas of race and feminism the most before finally deciding on feminism. I had been going to come up with my own compilation of speeches throughout history to analyze before deciding to narrow it down to the the one list I have now. Eight of the ten speeches were easily available, the two that gave me the most trouble were the longest and shortest pieces: A book and a 30 second clip. The clip took so much time because when I searched for it I thought the clip was a part of a longer speech before realizing that it was not, and after more searching I gave looking for it and wrote it down myself. The entire text was only about four sentences. The book was also a little more difficult to find then the speeches but I eventually was able to find a complete pdf that I could then convert to a .txt file.

Before using tools to analyze my documents I wanted to get more insight into the demographics of my speakers. Specifically, I was interested in the time period they lived in and their country of origin. The site listed all the years so I only had to search for each speaker’s country. What I found was that five of the speakers were from the United States, three were from the United Kingdom, one was from Myanmar and one was Australia.

Voyant was the first program used for corpus analysis and I found that it was very easy to use and intriguing. I would definitely agree with Professor Faull in that it is a good entry drug to analyzing a corpus. The friendly colors and interesting graphics made it easy to find some of the more basic patterns in the text. For me it was also useful in that it brought to light some issues I would be facing with my corpus’ dissimilarity in lengths. The book was larger than all the other speeches combined inevitably skewing the data and the smaller, 4 sentence speech had no representation because of how short it was and although it had powerful words, “humanism”,”sex” and “race”, they only appeared once. The best solution I could think of was to run my corpus through voyant with all ten documents and then with the nine smaller document. I also searched the book alone to understand some of the skewing when searching all ten. Of all the features voyant has, the collocation map was by far the most interesting representation of stereotypes and violence against women. I searched certain terms because of a discussion had in class about each gender’s usage of stereotypical words when referring to men and women; I was curious to see if feminist women would have the same trends in the speeches written to battle those stereotypes. It was interesting and a little frightening to see some of the words that came up: women were connected to “emotional”, “Christ”, “kill” and “families” while men are connected to “careers”, “misogyny”, and “sexism”. While women were still connected to “families” and “emotional” the terms for men show the speaker’s indignation and anger with how she is treated by men because of her femininity.

Antconc helped me unpack some of my results a bit more and gave more context and insight. When I used collocation on antconc I continued to see the use of stereotypes in the feminist speeches with phrases like “man predominates” and m en guffawed” versus “women cannot” women forced” “women should”.  Anrconc’s way of presenting did show me a change in the manner of speaking based on the times: In the earliest speech, Elizabeth I’s speech to the troops do not have the same tone of fighting for equality. She accepts the stereotype that women are week and feeble and proceeds to say she is different because she had the heart of a king. In other words she is not a good leader generally: she is a good leader because of her masculine qualities that boost her up. She is not strong because women are strong she is strong because she has a man’s strength.

Jigsaw had the potential to work really well for what I needed but I think my corpus would have worked better if it was expanded. I realized very quickly that many of the results I had been getting were only relevant to a specific speech. A speech where the speaker repeated a certain phrase or idea multiple times to drive an idea home but only in one speech. “Opposition” is important in  the misogyny speech because Julia Gillard is speaking to the “Leader of the Opposition”, Mr. Slippers (another word that came up on the word clouds) so she says opposition many times, not necessarily in the context someone would expect. Jigsaw also struggled to recognize sentiments in my speeches. None of the speeches were happy but Jigsaw labeled some happier than others.

I think continuing from here I will be expanding my corpus, maybe finding another, more focused list or just adding more speeches to the ones I have now. I was also determining whether or not I should add more of the longer books or if I should remove the one I have now from my corpus.

Comparison of Antconc and Voyant

 

Corpus Creation process
Figure 1 – Corpus Creation process

My corpus has not changed much since my las blog post: it is a collection of feminist speeches from Marie Claire. I fount the collection in an article called The 10 Greatest Speeches Of All Time By 10 Inspirational Women:

  1. Virginia Woolf, ‘A Room of One’s Own’ (1928)
  2. Emmeline Pankhurst, ‘Freedom or Death’ (1913)
  3. Elizabeth I, ‘Speech to the Troops at Tilbury’ (1588)
  4. Hillary Clinton, ‘Women’s Rights are Human Rights’ (1995)
  5. Sojourner Truth, ‘Ain’t I a Woman’ (1851)
  6. Nora Ephron, ‘Commencement Address to Wellesley Class of 1996’ (1996)
  7. Aung San Suu Kyi, Freedom From Fear’ (1990)
  8. Gloria Steinem, ‘Address to the Women of America’ (1971)
  9. Julia Gillard, ‘The Misogyny Speech’ (2012)
  10. Maya Angelou, ‘On the Pulse of Morning” (1993)

Most of the speeches had transcripts  available online making my creation pretty simple: copying and pasting them into textedit and making the files plaintext. The only text that was not available was Gloria Steinem’s ‘Address to the Women of America’ a thirty second script I was able to type up myself.

I chose this corpus because of the huge difference in time and location between the speakers. To have feminist speeches dating back to 1588 was a surprise, and I wanted to understand what it was that got these speeches on the list when there are so many other potential speeches in between. I knew beforehand that some of the main words in these speeches would be man/men and woman/women; I guessed, considering the topic, that other top words would be “rights”, and ” feminist”, maybe “vote”. Two of my speeches were actually literary works: a poem and a book so I considered the possibility of that effecting my corpus with different devices and words used for poems than in speeches. The book was also much larger than the rest of the corpus combined and my smallest piece was only a 30 second speech: this would definitely skew my data, something I would have to look for how to deal with once I start my analysis.

Screen Shot 2016-02-28 at 9.00.50 PM
Figure 2 – Wordcloud Top: Full corpus Bottom: Without the longest doc

The first thing I did was search the most common words in all 10 texts. I noticed that as I had predicted, my data was greatly skewed, with the most common words in my word cloud only showing high frequencies in the book and not in the rest of the texts. Woman showed in many of the texts but “mind” and “like” were only significant words in A Room of One’s Own. In figure 2 you can see the huge difference that results from removing the book from the corpus: more expected words showed up such as rights and opposition. I think my searches also very much solidified the theory that while computer analysis can be very telling about a text, context is also necessary. I was surprised by “slippers” showing on the wordcloud, unable to imagine a manner in which slippers could be relevant in a feminist speech but when I clicked on it I was taken to a speech addressed to and about “Mr. Slippers”, in which the speaker says his name frequently throughout the text.

Screen Shot 2016-02-28 at 9.19.52 PM
Figure 3- Full corpus analysis
Screen Shot 2016-02-28 at 9.28.28 PM
Figure 4-9 shorter docs

The class’s conversation about words most closely connected to words referring to men and women also sparked an interest in me. I wondered if those same patterns would even be evident in speeches by feminist women. When I did my first search with all ten documents and did come up with the words “families”, “children”, “emotional”, and scarily, “beating”. For the nine shorter documents I got “families” and “children” again as well as another violent word, “killing”.

I used the collocates tool a lot more on antconc than I did for voyant. I think in Voyant I was distracted by the image producing tools: I spent sometime trying to figure how to make the “knots” tool at all relevant to my research. Collocate from Antconc is important in giving context to those visual representations. Some important terms I saw were “man predominates” and “men guffawed” terms that give a certain visual versus “women should” “women cannot” and “women forced”. These observations represent to me the belief in the speaker’s of a powerful man overpowering or lording over trapped women.

Screen Shot 2016-02-28 at 9.45.52 PM
Figure 5 – excerpt from Elizabeth I’s speech

Another interesting find from my search was the “feminism-levels” of women based on the time period. The earliest speech, Elizabeth I’s speech to the Troops, shown in figure 5, shows that she is aware of the men’s view and opinions of her but does not seem to see an issue with it. Instead of speaking of women as equals to men she speaks as if she is an exception in her likeliness to a man. Its not “I am strong because women can be strong” but rather “I am strong because I have a man’s heart”.

While voyant was nice to use as a way to begin analyzing the corpus I it was easy to distracted by the visuals it provided. Antconc, although a little boring in its visual representation allows a researcher to focus on the key terms and patterns uncovered by Voyant and more deeply analyze them. Antconc to me seemed to dive more into the text, going back to find the context of what certain results mean and to confirm assumptions made from the visuals of Voyant. Both look at unique words and vocabulary but the way they represent them is very different even though some tools are similar.

The process has definitely helped me find some interesting insight to the texts, showing me patterns that were completely unexpected as well as confirming results I had predicted. My main goal had been to find similarities and patterns of what makes a great feminist speech but the differences between the texts in the corpus make it difficult to analyze that well; however, despite fears of skewing I was able to find data and the beginnings of a pattern in my text.

My pictures are acting up heres a link to the presentation: https://docs.google.com/a/bucknell.edu/presentation/d/1EnOfkOXKJg8R43H3nE3OP6nDSA0d7pyB6wzHodHvwwo/edit?usp=sharing