Categories
"machine reading"

A Reflection on Machine Reading and Subjectivity

After a long period of deep pensive contemplation, I have emerged from my thoughts with a new found perspective on not only the digital marking of texts, but digital text reading as a discipline. The prompt for this blog asked “How has the process of stylometry and marking up affected your understanding of the corpus?”, and I would like to interpret this as my understanding of “the” corpus, not “my” corpus.

Regarding this, Pierazzo wrote “what we choose to represent and what we do not depends either on the particular vision that we have of a particular manuscript”. This brings up a brilliant point relative to all the work we are doing. Everything we do in this field is highly subjective. We choose the parameters, we choose the document, we choose the outcome. Much like the field of statistics, we set out to find proof to validate answers to questions. We propose research questions in order to give motivation and a frame to work in, but the honest truth is that in order to propose that research question, one must know about the works in question, and therefor must have biases. I would argue that 99% of the time research questions are proposed, researchers already have an answer in their mind, and this can shade their processes.

Pierazzo also wrote about the documentary digital editions of texts in her article, for example, stating they functioned “as the recording of as many features of the original document as are considered meaningful by the editors, displayed in all the ways the editors consider useful for the readers”. This is blatantly subjective, but not to a fault. It is just a fact, that the digital model of texts does not exist to escape subjectivity but the digital model exists to simplify analyses; “a model must be simpler than the object it models, otherwise it will be unusable for any practical purpose”

This is not meant to be a sentencing that all digital humanists will never be able to escape the binding of subjectivity, but instead, and I think Pierazzo would agree, is meant to be a simple reminder that we must keep in our mind when both reading analyses and analyzing ourselves. Each researchers individuality is what allows us to take on so many perspectives and attempt to understand them.

Keeping all this in mind, how has these processes affected my understanding of the corpus? Primarily, everything, including the corpus, is subject to biases. In order to illustrate this, I have a dendrogram of a particular section of Lord of the Rings. This section is separated into different sub-sections, once by the original creator of this digital text, and then a I combined all of these and had lexis, cut out sections using its own system. These were the results, Screen Shot 2016-04-27 at 10.37.32 AM

The higher links are unimportant, but instead focus upon the fact that all of the sub-sections created by lexis are clustered, and all of the sub-sections that came distributed this way are clustered. Keep in mind that all of these are the same sections of text. This is simply proof that the corpus is an extremely subjective part of the research, just as much as the outcomes.

Next I went to go put this to work myself. I worked within the sections Tom Bombadil, in Lord of the Rings, and marked up nature related terms in his speech. Screen Shot 2016-04-27 at 10.43.50 AM

I then wanted to compare the density of these references to those in his songs, but this is when I reached my large realization with digital mark up in oxygen; what is the end game? How would I complete my analysis? I think that oxygen needs to embed a way to analyze your marked up work into the actual program.

Drawing full circle now, I think that my work with oxygen and lexos have taught me an important thing that Professor Jakacki stressed during her presentation. Because one can mold the corpus in anyway possible, there is no end to the possible analyses one could do. It is important to keep the research question in perspective, and ask, when is enough, enough.

Categories
"machine reading"

Stylometry

Lexos

Recently, we have been analyzing the style of our corpora using Lexos and XML.  Right now, Lexos is one of my favorite tools that we have used.  Style is a very important part of song lyrics and I am glad that I have a better way to look at the style from song to song.  As I scrolled through the many tools that Lexos has, I was most interested in the statistics tab.  This gives me the amount of distinct terms, average term frequency, along with a few other stats.  The amount of distinct terms can tell me about how diverse the language used in American songs compared to the language used in songs from the UK.  I took the average of the amount of distant terms from the top 40 in each country and ended up seeing that the UK averages about 100 distinct terms per song while the USA averages 146 distinct terms per song.  When it comes to repetition in these songs, the UK average term frequency is 3.8 while the USA’s is 3.3.  This shows that the songs that are in the UK’s top 40 use more simple terms and do not repeat as often as those of the USA’s.  I did expect for one country to have more simple rhetoric and more repeating, but I sort of expected the USA’s top 40 to have these qualities.

When it comes to comparing the UK and USA top 40s to the Happy and Sad corpora, there are some interesting results.  Using the same statistics, we can compare the countries to the mood.  The group of happy songs has an average distinct term count of 108 per song.  This is closer to the UK top 40, which could be an indication that the UK top 40 has more “happy” lyrics than the USA’s top 40.  The happy songs also have an average term count of 3.2 which is closer to the USA’s top 40 amount of average word frequency.  This would lead me to believe that the USA’s top 40 is more “happy”.  Just as the UK top 40 matches up with the “happy” group of songs in terms of distinct terms and the USA top 40 matches up with the “happy” group of songs in terms of average terms frequency, the UK top 40 matches up with the “sad” group of songs in terms of distinct terms and the USA top 40 matches up with the “sad” group of songs in terms of average term frequency.

These statistics did not actually help me out too much, which is not what I was expecting.  I will continue to keep these findings in mind, however, because I think that they could become significant if I have some findings that support them.

Another interesting tool on Lexos was the dendrograms.  Dendrograms are tree diagrams that show relationships between texts based on style.

Screen Shot 2016-04-20 at 11.11.54 AMScreen Shot 2016-04-20 at 11.13.42 AM

 

 

 

 

 

 

 

 

The dendrogram on the left is the USA top 40 dendrogram and the one on the right is the UK top 40. It looks like the USA top 40 has more songs that have a similar style as opposed to the UK top 40.  We can see that there are more songs that have the blue lines in the UK top 40.

XML oXygen

Using oXygen to ‘mark up’ texts has also been a useful way to look at text.  When I began marking up a poem by Henry Reed, it made me think of Jigsaw.  Jigsaw has a focus on entities and doing markup, we were basically picking out all of the entities.

Screen Shot 2016-04-20 at 11.02.05 AMScreen Shot 2016-04-20 at 11.27.27 AM

These screenshots are of my mark up of this poem.  It is a good way to simplify the poem and look at what it is talking about.  It is then easier to see what is being focused on and what metaphors are being used.

I think that I will continue to explore the tools on Lexos to analyze my text and I do not plan on using XML as much although it is useful.  I think that once I can get Jigsaw working as well, I will have a much better time analyzing the sentiments of the songs.  This along with style analysis will go a long way in teaching me about my corpus.

Categories
"machine reading"

Is the reader dead?!

During these past weeks we learned  a lot of new platforms that we can use for our own text analysis. Using different platforms you discover new ways of reading and analyzing your corpora. As Dr. Diane Jakacki was telling at class “Imagine you are an eagle, and from the height of your flight you see down in the prairie a little mouse” about distant and close reading.

Stylometry and Lexos

Stylometry helps you to explore patterns in texts, stylistic analysis. On macro-level (when we are eagle and looking down from the very high) Lexos and it’s dendrograms are very useful. First of all, Lexos helps you with cleaning your text and editing your stop-list words. For me it was a big surprise that multiword cloud can read Russian.

multicloud        multicloud2

As I have small texts, for me it wasn’t a problem with the stop-list words or cleaning the text, main words like “I”, “monument”, “alive”, “die” are still appearing so I din’t change there anything. But still the results of this word cloud are different from the one I got in Voyant, which means that different platforms also read “differently” the same texts( the only option that doesn’t ignore Russian).

Then I decided to see how does dendrograms work.

MFWpng

I got the results that I expected: 1. it ignored Russian; 2. Lomonosov’s and Horace’s works were similar, that’s how it should be, because Lomonosov was translating Horace to Russian, including the poem Monument. However, the unexpected results were: 1. style of Vysotsky (who lived in USSR at 70s-80s) was also close to Lomosov (who lived in Russian Empire at XVIII century), which I can hardly believe and which I’m going to double check. My guess for noe is that probably the program caught the MFW in those two poems as “Muse” and other old Greek names and words ; 2. Pushkin (who lived in XIX century) was also close to Vysotsky style and that’s at least understandable because both poems were about building a monument to themselves through their works and that both poets were writing against the present system in the country. for me, that’s amazing that machine could see these details that I didn’t pay attention before. This reminded me skype-talk with Dr. O’Sullivan when we were talking about certain authors styles and how we can compare them. Of course, the most part the comparative analysis still will do the researcher not machine, using the advantages of close reading (when you can see a little mouse in the prairie).

TEI – oXygen

You can get more sophisticated about how you want to look at your text in the browser – and why you want others to look at certain things, too. As Pierazzo mentions in her work that marking up the text is an “interactive act”, I would add to that it could be “a complete disaster” or “making love with your text”.

Using this tool we need to know which parts of text we want to focus on. Here we can come back to Pierazzo and what she thinks about that “Which features of the primary source are we to reproduce in order to be sure that we are following ‘best practice’?” Marking up the text is very convenient, you are deciding at how many topics you want o focus in your text, for instance, only names and places.Then you mark up all particular words that are related for names and places. But here is the trick: working with poems I can’t say that this is only “black or white”. Marking up the text is your own personal way of seeing it.  In my corpora I differentiate several topics: life, death, monument, glory, time, nation, sound, etc. Each author talks about it differently and brings up new metaphors for that and I need to rely only on my “feeling” of the text and my background knowledge. After we worked at class on Keat’s  poem I decided to mark up my texts, but I wasn’t very successful with that 🙁

My final thoughts about the work we’ve done these past two weeks were “Is the reader dead?” Just as Bathes “Death of the Author”  we are now experiencing “death of the reader”. Although, working closely with tools I realized that reader is not dead, he’s very active, he “interacts” with machine and produce a new text that gives us many answers or leaves us with many questions.