lexos – Introduction to Text Analysis:

A Reflection on Machine Reading and Subjectivity

After a long period of deep pensive contemplation, I have emerged from my thoughts with a new found perspective on not only the digital marking of texts, but digital text reading as a discipline. The prompt for this blog asked “How has the process of stylometry and marking up affected your understanding of the corpus?”, and I would like to interpret this as my understanding of “the” corpus, not “my” corpus.

Regarding this, Pierazzo wrote “what we choose to represent and what we do not depends either on the particular vision that we have of a particular manuscript”. This brings up a brilliant point relative to all the work we are doing. Everything we do in this field is highly subjective. We choose the parameters, we choose the document, we choose the outcome. Much like the field of statistics, we set out to find proof to validate answers to questions. We propose research questions in order to give motivation and a frame to work in, but the honest truth is that in order to propose that research question, one must know about the works in question, and therefor must have biases. I would argue that 99% of the time research questions are proposed, researchers already have an answer in their mind, and this can shade their processes.

Pierazzo also wrote about the documentary digital editions of texts in her article, for example, stating they functioned “as the recording of as many features of the original document as are considered meaningful by the editors, displayed in all the ways the editors consider useful for the readers”. This is blatantly subjective, but not to a fault. It is just a fact, that the digital model of texts does not exist to escape subjectivity but the digital model exists to simplify analyses; “a model must be simpler than the object it models, otherwise it will be unusable for any practical purpose”

This is not meant to be a sentencing that all digital humanists will never be able to escape the binding of subjectivity, but instead, and I think Pierazzo would agree, is meant to be a simple reminder that we must keep in our mind when both reading analyses and analyzing ourselves. Each researchers individuality is what allows us to take on so many perspectives and attempt to understand them.

Keeping all this in mind, how has these processes affected my understanding of the corpus? Primarily, everything, including the corpus, is subject to biases. In order to illustrate this, I have a dendrogram of a particular section of Lord of the Rings. This section is separated into different sub-sections, once by the original creator of this digital text, and then a I combined all of these and had lexis, cut out sections using its own system. These were the results, Screen Shot 2016-04-27 at 10.37.32 AM

The higher links are unimportant, but instead focus upon the fact that all of the sub-sections created by lexis are clustered, and all of the sub-sections that came distributed this way are clustered. Keep in mind that all of these are the same sections of text. This is simply proof that the corpus is an extremely subjective part of the research, just as much as the outcomes.

Next I went to go put this to work myself. I worked within the sections Tom Bombadil, in Lord of the Rings, and marked up nature related terms in his speech. Screen Shot 2016-04-27 at 10.43.50 AM

I then wanted to compare the density of these references to those in his songs, but this is when I reached my large realization with digital mark up in oxygen; what is the end game? How would I complete my analysis? I think that oxygen needs to embed a way to analyze your marked up work into the actual program.

Drawing full circle now, I think that my work with oxygen and lexos have taught me an important thing that Professor Jakacki stressed during her presentation. Because one can mold the corpus in anyway possible, there is no end to the possible analyses one could do. It is important to keep the research question in perspective, and ask, when is enough, enough.

Machine Reading

Throughout the past two weeks our class has focused heavily on exploring texts on a macro and micro level. With the use of different digital platforms, we have been able to explore our corpus’s in a more complex and sophisticated manner than ever before. Part of the reason why this exploration was so difficult and hard to interpret is because all of these platforms are very statistical and numerical rather than user friendly like Voyant for example. So, the first step in really understanding what it was we were looking at, was adapting ourselves to the complexity of what we were looking at visually. Throughout the entire process though it was important to remember that our research was not computer driven, but instead the computer was the tool used to drive us to answer or answers per say. Pierazzo says, “The challenge is therefore to select those limits that allow a model which is adequate to the scholarly purpose for which it has been created (Pierazzo).” And the scholarly purpose is for us to be able to create a bigger picture of the text and inferences, but it is our job to decide how we use these digital platforms to do so. By using these platforms, we are able to look at our texts either zoomed in or zoomed out, in order to ultimately support our interpretations.

When first looking at the dendrograms in order to focus on the similarities and differences between the political speeches I was using, I was frazzled to say the least. So I started off by just inputting Obama’s speeches and seeing what I could come up with. A score of 1 indicates tight, distinct clusters, where numbers closer to 0 represent overlapping clusters. This is used to show how similar each speech is to all of the speeches as a whole. As we can see below most values are much closer to 0 indicating that the speeches might have some overlapping and similar word uses.

Screen Shot 2016-04-18 at 5.39.26 PM

I then went on to look at Hillary’s speeches in the dendrogram. The values on her numbers tended to be around the .05 mark, but over all higher than Obama. This might suggest to me that Hillary’s speeches are more different in her word use and styleometry.

Screen Shot 2016-04-18 at 5.39.15 PM

When putting in all of Hillary and Obama’s speeches, I got even more complex results. The use of different colors and high numbers suggests to me that essentially their speeches are pretty different and don’t have much in common. My results were not really that interesting because they are pretty predictable. It is only natural that Hillary’s speeches would differ from Obama’s because they are different people who speak differently. Although, the fact that Obama’s speeches were seemingly more similar, could influence the way that he imposes and reiterates his ideas to the American people as oppose to Hillary and the more severe difference in her speeches.

Screen Shot 2016-04-18 at 5.38.59 PM

In the continuance of our exploration in text analysis we moved on to the use of Oxygen, which was only more complicated of a notion to me than using dendograms. With oxygen we are looking at each word closely and actually telling the computer how to interpret these texts. This is important because it further shows how the computer programs are only a tool to further our understanding rather than the entire purpose of our corpus. For example, I labeled below each time “Ophelia” was referred to as her name or as “her, she, etc.” to understand how people view her.

Screen Shot 2016-04-15 at 1.51.20 PM

Pierazzo says “It is the argument of this article that editions as we know them from print culture are substantially different from the ones we find in a digital medium (Pierazzo).” And therefore, it was essential that we used these prints to extract a greater meaning and form a deeper understanding of this text as well as our own corpus. Pierazzo also states that it is difficulty choose “which features of the primary source are we to reproduce in order to be sure that we are following ‘best practice’(Pierazzo).” This also goes to show how arbitrary of a practice this is because the computer is a tool that we are using rather than the computer just giving us all the answers. Moreover, in using all these digital platforms I was able to see all of their flaws as well as all that they have to offer.