text encoding – Introduction to Text Analysis:

Throughout this whole semester, we have learned quite a few tools to assist us analyzing text in the “distant reading” way. Recently, we have learned a way to let machines read our corpus for us and further understand the context for us using some sort of algorithms on a macro level. Based on machines’ understanding, they are able to show the stylometry of our corpus. Stylometry allows me to consider at a comparative and macro level how different cultures and origins are related in terms of cooking.

Lexos

In order to analyze my corpus on a macro-level, I worked with a platform called Lexos. Lexos is a very intelligent machine for reading corpus and understanding them. It has several features that enables me to establish relationship between cultures. When I used Lexos, I first scrubbed my corpus using a list of stopword provided by Jigsaw (I also expanded the stopword list based on my corpus). After cleaning up my corpus, I tried multiCloud to get a nice visualization of my corpus like in Voyant:

Screen Shot 2016-04-18 at 11.18.26 PM

These word clouds show nothing more than the key terms within each corpora. Nevertheless, I was looking for something more specific that can show if there is some kind of relationship between different cookbooks. Then I continued using another tool to create a dendrogram that Dr. James O’Sullivan had talked about last Monday. Here is the result I obtained:

Screen Shot 2016-04-18 at 1.26.53 PM

From the dendrogram, I found out that Chinese, Egyptian, Thai, and Korean Cookbooks share a lot of similarities. This result is not surprising since these cultures all locate geographically close to each other. It is quite interesting to find out that Italian cookbook and Australian cookbook are similar. I was expecting to see British cookbook share more similarities with Australian cookbook since Australian used to be a colony of Britain. Having this question in mind, I used another tool provided by Lexos to compare corpus. Here is the result I obtained:

Screen Shot 2016-04-19 at 12.27.55 AM

The figure above is the similarity rankings among all the cookbooks in my corpus. The higher the value, the closer the comparison document’s vector is to that document’s vector as opposed to the other documents’ vectors. This result really shocked me (without exaggerating). I was really expecting to see many similarities between Britain and Australia regarding cooking. It seems like the colonial effects are gradually fading away as time is flowing. However, Australian cookbook is very similar to Indian cookbook. In fact, Australia and India both used to be British colonies. Therefore, these results suggest that the colonial effects perhaps still exist in both cultures; Britain may happen to be the country that has evolved or improved its recipes.

TEI and XML

Finally, I also learned another way of analyzing my corpus on a micro-level. I can use XML to markup some important entities within the text of my corpus. I actually have some experience with marking-up desired entities within a large-scale corpus. I used Python scripts to extract personal information (name, age, occupation, father’s name, mother’s name, children’s names and grandfather’s name) from thousands of Chinese biographies in XML format. After extracting the text I was interested in, I attached entity tags wrapping around the part of text. Here is a piece of text from these Chinese biographies:

Screen Shot 2016-04-19 at 12.58.21 AM

As you can see, there are some parts of text are wrapped by a <grandfather> tag; this means that specific part of text describes this person’s grandfather. Because of these tags, it becomes so much easier for me to let machine catch small fraction of text for me to analyze instead of reading a humongous text file. My past experience really excites me marking-up my cookbook corpus. I believe marking-up cookbooks will help my textual analysis technique to reach a whole new level. I have marked up all the protein and vegetables in my corpus. Here is one example:

Screen Shot 2016-04-19 at 1.21.46 AM

For now, the generated HTML file is not showing anything special based on different tags. Yet I will tweak the CSS in order to make different kind of ingredients to stand out. In addition, I’m quite confident with the results I will get using text markup.

During the past two weeks, in order to investigate text, we’ve tried out different platforms and techniques in which machines read texts. On a macro level, we’ve learned about stylometry and produced dendrograms based on delta and zeta-values in our corpus. On a micro level, we’ve begun to learn XML-compliant TEI markup.

From the lecture last Monday, we learned about stylometry which is used to attribute authorship to anonymous or disputed documents. Besides, stylometry also enables us to think at a comparative and macro level how an author expresses himself or herself. I remembered that during the presentation Dr. James O’Sullivan showed an example of identifying an author’s work among the other author’s works by using dendrograms. Since Lexos provides us with clusters analysis based on delta analysis (based on the most frequent words) and zeta (based on distinctive words), it could identify the differences of a file comparative to another and also show us the relationships across our corpus.

First, I uploaded my screenplay corpus. And for cleaning, I uploaded my own version of stop word list based on what we had from Jigsaw (in the resources folder). I got the result dendrogram in Figure 1.

Screen Shot 2016-04-17 at 9.49.10 PM

Figure 1: Dendrogram created for screenplay corpus

This graph seems to be interesting but I still did not find why the dendrogram identifies the relationship of these screenplays like this.

In order to understand what the dendrogram means, I thought I could upload my novels corpus to see if there are similarities between these two dendrograms. However, I got another graph that I do not understand. (See Figure 2 below)

Screen Shot 2016-04-18 at 12.30.17 AM

Figure 2: Dendrogram created for novels corpus

I carefully checked the y-axis of these two dendrograms and I did not find any two novels/screenplays are relatively similar (located at relatively closed positions on y-axis) or have similar delta or zeta values. The only finding that I have is that there are only two colors shown on Figure 1 (screenplay dendrogram) and the relationships between screenplays tend to be very simple. However, on Figure 2 (novels dendrogram), there are four colors shown to display the relationships between these novels and the files colored in red are comparatively different than the rest. I think the reason that why these two graphs are different in these ways is that screenplays have relatively more fixed and consistent format than novels. (Screenplays have their own writing formats.)

I believe that there are mainly two reasons that why I did not get interesting results. One is that these screenplays and novels were written by different authors so they have extremely diverse writing styles and diction in their writings and the second is that I do not have a good macro-level understanding of these files. It is still far from enough to only know about the plots of each movie.

During the process of learning text encoding, we first got to know about transcription. According to Pierazzo, the main purpose of transcription is to “reproduce as many as characteristics of the transcribed document as followed by the characters used in modern print.” But there are many features one might want to consider from the infinite set of “facts” that can be found on the written page. Besides, as Pierazzo also states that one of the downsides of the traditional transcription method is that the editors must make judgements on what they need to include and in order to make a consensus across the academia, they need to have rules and guidelines to achieve the best practice. However, it is difficult for scholars to have common guidelines and the transcription process itself involves objective interpretation. That’s when the advantages of using markup language come into play. Following the TEI guidelines, editors could easily keep record of meta-information. We have two separate objects: one is the data model (the source) and the other is the publication (the output). As Pierazzo mentions “one of the reasons why the TEI model effective is because it enables the encoding and transcription of several alternatives for the same segment allowing”, the source file could contain not only a diplomatic edition but also other editions. The advantage of encoding with TEI is that “to all intents and purposes there is no limit to the information one can add to a text—apart, that is, from the limits of the imagination”. And people could really move to the analytic level of the editing process by utilizing XML.

As Pierazzo states that if the editor uses XML (TEI)-based system, “the editor’s goal needs no longer be ‘to reproduce in print as many of the characteristics of the document as he can’ but rather to achieve the scholarly purpose of the edition—a purpose which, by definition, varies.” TEI markup enables scholars to perform different analysis based on their own research purposes. For example, if a researcher is interested in identifying different types of rhymes in poems, he or she could markup the corresponding interesting words; if a researcher is curious about the diction and writing styles that a particular author has when writing novels, he/she could work on markup specific diction and syntax.

As a “researcher” myself, I intend to find the relationships between screenplays and their corresponding novels. XML could really help me markup different entities in the screenplays such as characters, dialogues, scenes, descriptive and background metadata, etc. since a screenplay is a semi-structured textual document.

I marked up a small part of the screenplay of The Theory of Everything based on the hamlet_scene.xml provided.

Screen Shot 2016-04-18 at 1.59.35 AM

Screen Shot 2016-04-18 at 2.00.24 AM

My next step would be to see how the script writer uses verbs and I will manipulate CSS to make the patterns observable.