Categories
Uncategorized

Final Project

My final project was hosted on Github.

Please visit this page:  http://taylorty.github.io/HUMN100-Final-Project/

A Google Drive link for my corpus https://drive.google.com/a/bucknell.edu/folderview?id=0B-tJKv_fCxZUQU9iZnpMeUxQa3M&usp=sharing

Categories
"machine reading"

Machine Reading Text

During the past two weeks, in order to investigate text, we’ve tried out different platforms and techniques in which machines read texts. On a macro level, we’ve learned about stylometry and produced dendrograms based on delta and zeta-values in our corpus. On a micro level, we’ve begun to learn XML-compliant TEI markup.

From the lecture last Monday, we learned about stylometry which is used to attribute authorship to anonymous or disputed documents. Besides, stylometry also enables us to think at a comparative and macro level how an author expresses himself or herself. I remembered that during the presentation Dr. James O’Sullivan showed an example of identifying an author’s work among the other author’s works by using dendrograms. Since Lexos provides us with clusters analysis based on delta analysis (based on the most frequent words) and zeta (based on distinctive words), it could identify the differences of a file comparative to another and also show us the relationships across our corpus. 

First, I uploaded my screenplay corpus. And for cleaning, I uploaded my own version of stop word list based on what we had from Jigsaw (in the resources folder). I got the result dendrogram in Figure 1.

Screen Shot 2016-04-17 at 9.49.10 PM

Figure 1: Dendrogram created for screenplay corpus

This graph seems to be interesting but I still did not find why the dendrogram identifies the relationship of these screenplays like this.

In order to understand what the dendrogram means, I thought I could upload my novels corpus to see if there are similarities between these two dendrograms. However, I got another graph that I do not understand. (See Figure 2 below)

Screen Shot 2016-04-18 at 12.30.17 AM

Figure 2: Dendrogram created for novels corpus

I carefully checked the y-axis of these two dendrograms and I did not find any two novels/screenplays are relatively similar (located at relatively closed positions on y-axis) or have similar delta or zeta values. The only finding that I have is that there are only two colors shown on Figure 1 (screenplay dendrogram) and the relationships between screenplays tend to be very simple. However, on Figure 2 (novels dendrogram), there are four colors shown to display the relationships between these novels and the files colored in red are comparatively different than the rest. I think the reason that why these two graphs are different in these ways is that screenplays have relatively more fixed and consistent format than novels. (Screenplays have their own writing formats.)

I believe that there are mainly two reasons that why I did not get interesting results. One is that these screenplays and novels were written by different authors so they have extremely diverse writing styles and diction in their writings and the second is that I do not have a good macro-level understanding of these files. It is still far from enough to only know about the plots of each movie.

During the process of learning text encoding, we first got to know about transcription. According to Pierazzo, the main purpose of transcription is to “reproduce as many as characteristics of the transcribed document as followed by the characters used in modern print.” But there are many features one might want to consider from the infinite set of “facts” that can be found on the written page. Besides, as Pierazzo also states that one of the downsides of the traditional transcription method is that the editors must make judgements on what they need to include and in order to make a consensus across the academia, they need to have rules and guidelines to achieve the best practice. However, it is difficult for scholars to have common guidelines and the transcription process itself involves objective interpretation. That’s when the advantages of using markup language come into play. Following the TEI guidelines, editors could easily keep record of meta-information. We have two separate objects: one is the data model (the source) and the other is the publication (the output). As Pierazzo mentions “one of the reasons why the TEI model effective is because it enables the encoding and transcription of several alternatives for the same segment allowing”, the source file could contain not only a diplomatic edition but also other editions. The advantage of encoding with TEI is that “to all intents and purposes there is no limit to the information one can add to a text—apart, that is, from the limits of the imagination”. And people could really move to the analytic level of the editing process by utilizing XML.

As Pierazzo states that if the editor uses XML (TEI)-based system, “the editor’s goal needs no longer be ‘to reproduce in print as many of the characteristics of the document as he can’ but rather to achieve the scholarly purpose of the edition—a purpose which, by definition, varies.” TEI markup enables scholars to perform different analysis based on their own research purposes. For example, if a researcher is interested in identifying different types of rhymes in poems, he or she could markup the corresponding interesting words; if a researcher is curious about the diction and writing styles that a particular author has when writing novels, he/she could work on markup specific diction and syntax.

As a “researcher” myself, I intend to find the relationships between screenplays and their corresponding novels. XML could really help me markup different entities in the screenplays such as characters, dialogues, scenes, descriptive and background metadata, etc. since a screenplay is a semi-structured textual document.

I marked up a small part of the screenplay of The Theory of Everything based on the hamlet_scene.xml provided.

Screen Shot 2016-04-18 at 1.59.35 AM

Screen Shot 2016-04-18 at 2.00.24 AM

My next step would be to see how the script writer uses verbs and I will manipulate CSS to make the patterns observable.

Sentiment Analysis by Machines

I think machines could read emotions but definitely not as well as human beings.

“Sentiment” is a very ambiguous concept. I took CSCI203 last year and our final project was to analyze tweets sentiments. What we did was to have a dictionary with words and a floating point number that represents their relative sentiments. Positive values mean happiness and negative values mean sadness. The program basically calculates the sentiments by adding the sentiment value of each word and plot the results across the nation based on their locations. The result of our design turned out to be roughly accurate and very interesting. We did discover unexpected and expected results. However, if we think twice, there were many problems with this method of calculating sentiments. As Ramsey asks in his book, “Who decides what sentimentality is?” I am sure there are many versions of sentiment dictionaries online but I don’t think there is an accurate dictionary that could serve as a reference for all the sentiments analysis. “Sentimentality” itself is an extremely subjective concept that depends on person to person and also requires a lot of data analysis and data mining. Even if we did manage to create a very accurate dictionary, the machines still could not analyze the sentiments of a whole sentence or paragraph as humans do. Sometimes, people use metaphors, sarcasm, comparison, etc. to convey their sentiments implicitly, so it will be really difficult for machines to analyze sentiments. Maybe machines could be trained through machine learning with enormous amount of data in order to make more accurate emotions reading someday.

Even people ourselves could have divergences on different concepts. From the topic modeling practice in class that we colored the words related to “war”, “government” or both, we could conclude that even we ourselves could not have a consensus of opinions on what we see or read. Ramsey states that “‘meaning’ is itself a shifting, culturally located concept incapable of precise definition or stable articulation.” Different people with different experience, cultural and academic background will have different understandings and interpretations of the texts that they see.

The sentiment analysis of machines means nothing if humans do not intervene. Ramsey concludes that “we will have understood computer-based criticism to be what it has always been: human-based criticism with computers” in the last chapter of his book. For a researcher who analyzes corpus sentiment, the results mean nothing if he or she does not really know the corpus, since first it is impossible for him or her to judge if the result is correct; second, he or she could not fully interpret the results or explain interesting discoveries or details in the results.

From the tools that I used such as top modeling tool Mallet and sentiment analysis tool Alchemy, I also discovered the downsides of machine reading. I first put all my scripts corpus into Mallet and I did not get any useful results. (See picture below) It displays a lot of names and stop words of the screenplays as a list of topics.

Screen Shot 2016-04-03 at 9.27.59 PM

I also put in my corpus of novels. Similar things happened.

Screen Shot 2016-04-03 at 10.26.51 PM

But when I put in individual file such as Alan Turing autobiography, it did show some useful results.

Screen Shot 2016-04-03 at 10.28.57 PM

It did display a lot of keywords of the novels. It provided some kinds of main ideas of the novel. From my experiment, I roughly conclude that Mallet is very useful for an individual text file but not for mixed and complicated corpus with a large size.

As for sentiment analysis function of Jigsaw, I tested it in the past and it corresponded my understanding of the files.

I did see the two movies (Silver Lining Playbook and Atonement) and I agree with the sentiments results I got since Atonement is indeed a more sad tragedy and on the contrary, Silver Linings Playbook is a happy ending movie. It is also interesting that there is no negative sentiment in the screenplays.

For Alchemy, I found that it is a very professional and useful tool for sentiment analysis. Its entity analysis contains more advanced and accurate subtypes and linked data than Jigsaw since Alchemy is a web-based tool such that it provides updated and advanced information.

A big advantage of Alchemy is that it does not have a low limit for the text file that could be uploaded. I could easily upload one novel in my novels corpus. For the test, I uploaded a Alan Turing: The Enigma to the API and to see what result it would give us. Before I uploaded it, I first had my own sentiment analysis for the novels. I think the main role Alan Turing has mixed sentiment in the novels since he successfully solved the Enigma but he could not let the message be heard by the German in order to achieve the ultimate victory. After I loaded the novel, the Alchemy did mark Alan Turing as a mixed sentiment which matches my understanding of the novel. However, it counted “ALAN TURING”, “Alan Mathison Turing” and “Alan Turing” as three separate entities, and I think it is a downside of the Alchemy API. As a web-based tool, it could use the resources online to provide better performance on identifying entities. Other than that I feel the analysis of sentiment is very useful because it provides “mixed” and “neutral” which is more useful for us than only “positive” and “negative” sentiments. It provides us with more insights about what is the sentiment of the loaded file for not only the entire file but also its different entities. Jigsaw only provides a sentiment analysis on the whole document and on the contrary, Alchemy provides sentiment analysis on entities identified. In this way, we could easily see how a particular person feels or sees a particular thing in the article.

Screen Shot 2016-04-04 at 9.40.19 AM

Screen Shot 2016-04-04 at 10.16.55 AM

Screen Shot 2016-04-24 at 2.12.21 AM

Screen Shot 2016-04-04 at 10.17.24 AM

It also provides useful taxonomy and emotion sessions which are very accurate and interesting to look at.

Another thing that I felt Alchemy could be better is that it should allow multiple files uploaded. Right now it only allows users to type in one text file.

After trying the tool, I felt curious and searched Alchemy online and found out that it is an API that uses machine learning (deep learning) to do natural language processing. So it is no doubt that it provides more accurate analysis than Jigsaw since machine learning means that the API is supervised and trained by real humans. It is supposed to be able to provide a sentiment analysis that is much adhered to real human beings.

In order to test if Alchemy is relatively accurate. I also load the text files of the screenplay of the Imitation Game (which is the adapted movies based on the novel Alan Turing: The Enigma) to check if there are some correlations between the two. Below is the result that I got:

Screen Shot 2016-04-24 at 1.46.33 AM

Screen Shot 2016-04-24 at 2.11.59 AMScreen Shot 2016-04-24 at 2.12.10 AM

It also identifies that Alan Turing has mixed sentiment. But I found really interesting results when I looked at Document Sentiment and Document Emotions. Alchemy gives a negative score of sentiment for the screenplay compared to the positive score of the original. It also gives the screenplay a 0.48445 anger score compared to the 0.091749 in the novel. I will look into this interesting result to investigate if the screenplays usually displays a relatively extreme or exaggerated sentiment compared to the original novels.

Text analysis is a roughly new field and I believe someday machines could eventually read emotions as well as humans.

Reference:
Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Urbana: U of Illinois, 2011. Print.