Corpus Creation 2.0

As was expected, there has been a change in direction that my corpus is headed.  Originally, my plan was to analyze the lyrics of songs that are considered either happy or sad.  My goal was basically to see what the factors are in these songs that make them considered to be either happy or sad.  This would be interesting, but I decided to go a step further and add to my corpus of songs the top 40 songs in the US, United Kingdom, and Australia.  Now, once I gather all of the trends that make a song happy or sad, I can apply this knowledge to the top 40s of the various countries and see what kind of trends are most commonly used in each country.  Fortunately, Spotify has playlists with all of the top 40s for these countries.  I decided to pick the countries that I did because they are all english speaking, which gets rid of the translation issue that I would have had with most other countries.   I will try to keep all of Hofstede’s Cultural Dimensions in mind as I analyze my results to see how they are intertwined in the type of lyrics that are used.


Using both AntConc and Voyant to analyze these texts have led to some interesting findings.  So far, on AntConc, I have only been using word frequency tools of words that I think would be significant.  This, is not a great way for me to analyze the texts because I am using my bias to think of words to look for.  Right now, the only significant finding I have using AntConc is that there are more female pronouns used in “happy” songs than “sad” songs.  There is actually almost no female or male pronouns used in “sad” songs.  This could be as a result of song writers realizing that people can be sad about a lot of things and decided to make them less people specific and more internalized.

I found Voyant’s collocation tool to be particularly attractive.  This tool was able to show me words that are most frequently repeated.  I found that words repeated in the “Sad” songs included “know”, “cry”, “love”, “used”, “mind”, and “I’m”.  Most of these words can tell us a little bit about what the songs talk about.  We see that “cry” is used, which is a pretty standard word that would be expected in a sad song.  The word “used” is also frequently repeated.  This could be telling us two things.  Either that the songs are about someone or something feeling or being used, which could very well be an element of a sad song.  It could also mean that the songs are talking about the way things “used” to be.  This could be targeting the listener by trying to trigger old memories that may cause them to consider the song “sad”.  Words repeated in the “happy” songs include “la”, “love”, “time”, “I’m”, “say”, and “like”.  “La” being something that is repeated frequently can show us that lyrics could be less important in these “happy” songs and they are more about the melody and catchiness.  What I noticed is that “love” is repeated frequently in both of these groups of songs. 

“Because love makes you happy and love makes you sad” – Professor Katie Faull


This is a screenshot of the "trends" tool on Voyant.  It is showing the trends of the words "love", "like", "come", "way", and "la".
This is a screenshot of the “trends” tool on Voyant. It is showing the trends of the words “love”, “like”, “come”, “way”, and “la” in the Happy songs.

 

This is a screenshot of the "trends" tool on Voyant. It is showing the trends of the words "know", "love", "just", "like", "I'm" in the Sad songs.
This is a screenshot of the “trends” tool on Voyant. It is showing the trends of the words “know”, “love”, “just”, “like”, “I’m” in the Sad songs.

 

Corpus Creation

I have been gathering a corpus of song lyrics.  I intend to analyze the differences in words used, word frequency, word dependence, and anything else I can measure between songs that are supposed to provoke certain emotions.  I decided to use Spotify as a resource for grouping songs.  There is a section of Spotify that organizes playlists based on mood.  The initial playlists that I chose were called “Don’t Worry Be Happy” and “Down in the Dumps.”  One thing that I noticed when trying to find these playlists was that there were much more “happy” playlists then “sad” playlists and of the few playlists that were “sad” themed, none of them actually had the word “sad” in them.  This made me think about social standards and if Spotify was encouraging being happy and discouraging being sad.  This is something that I will continue to keep in mind as I gather my results.

When choosing songs, I decided to leave out some from the “Don’t Worry, Be Happy” playlist.  Some songs that I did not include are: Don’t Worry Be Happy, Happy, Oh Happy Day, Shiny Happy People, if it Makes You Happy, and Happy go Lucky me.  I thought that these songs, having the word “happy” in the title would create an inaccuracy by having the word “happy” be used too much.  All of the songs from the “Down in the Dumps” playlist were used because none of their titles/choruses had an over use of certain words.

I googled every single song on each playlist for its lyrics and put them onto a document.  After that, I cleaned them.  Many times, there is a heading before the chorus that denotes it as the chorus.  The same goes for verses and if there is a different singer, that is noted.  I deleted all of these headings.  In most of the lyrics, when there were repetitions, it would be put in numbers.  For example, if a chorus is repeated twice, it would say [2x].  When this happened, I would delete the number and repeat the chorus manually by copy and pasting it.  At first, I was unsure as to whether I should keep repetitions.  I decided to keep them because this is obviously something that the artist thought was important.  It could be a good indicator of what creates the mood in the song.  Because the lyrics are written records of words that are sung, there are often times grunts, or slang used (Ex. “cause” instead of because).  All of these slang words are spelled the same on the lyric sheets, so I decided to leave them in without changing them to keep the lyrics as natural as possible.

I am really interested in seeing the results of the textual analysis done on these lyrics.  As of right now, I am not sure what results I will get.  I am hoping that there is some sort of correlations in words and emotion that will be surprising.  I will try to analyze these lyrics in as many different ways to catch any and every trend that there is.

 

WHAT IS DISTANT READING?

Distant reading is a term that was created by Franco Moretti, which means: “understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data.”  This means that in order to learn more about literature, one must read less books.  Moretti argues that close reading (as opposed to distant) cannot uncover the scope and nature of literature.  Distant reading and text analyzing tools have been referred to as “magnifying glasses” by William Gass.  He acknowledges that taking a step back and reading from a distance can help to see things that he otherwise would not have seen.  Carolynn Van Dyke also refers to using computers in text analysis as a magnifying glass of sorts by saying how having a computer version of a text can help to illuminate trends that are otherwise hidden in the text.


 

Distant reading is meant to support claims made about certain works or to help us understand trends that texts have in common.  Hoover states, “Computer-assisted textual analysis is neither a panacea nor substitute for sound literary judgment, but its ability to refine, support and augment that judgement makes it an important analytic method for literary studies in the digital age.”  This means that the distant reading is not responsible for creating new ideas about texts, but it can be very helpful in making claims about texts credible by providing evidence.


 

We can learn a lot about a text that would otherwise be lost by distant reading.  Syntax, word frequency, and other measuring factors can easily go unnoticed by a reader.  Distant reading allows one to analyzing a text without the preconceived notions that they may have about a text.  We can learn a lot by reading from a distance because it defamiliarizes the text and allows us to analyze the text from a different perspective.

Examples of texts being defamiliarized can be seen below using Google N Gram, Wordle, and Bookworm:


 

WORD FREQUENCY

Word frequency can be one of the biggest indicators of a texts meaning.  Word frequency can help us to learn a lot about a text.  In his book Computation into Criticism, John Burrows writes, “From no other evidence than statistical analysis of the relative frequencies of the very common words, it is possible to differentiate sharply and appropriately among the idiolects of Jane Austen’s characters and even to trace the ways in which an idiolect can develop in the course of a novel.” The importance of word frequency in one of the most famous speeches in America, Martin Luther King Jr.’s I Have a Dream is not surprising.

Wordle MLK

As you can see, “freedom” is the most used word.  Just by looking at this visual representation of his speech, one would be able to tell that King was advocating for freedom for black people.  His commonly used words are obvious in this representation, but it is not as easy to pick up on frequencies while performing a close reading.

 

 


 

NGRAM OF PATRIOT LEAGUE SCHOOLS

 

Patriot League NGram

The screen grab above is of an N Gram of all of the schools in the Patriot League.  Not surprisingly, Navy was atop the list of all the schools by a good amount.  Unfortunately, Bucknell was second from the bottom.  The Google N Gram shows the frequency of the words you type in in the corpus that they have, which is Google Books.

Patriot League

This graph is the same N Gram, but without Navy.  We are able to see the trends of all of the other schools in the league that are not written about as frequently as Navy.  We see that Lafayette had a spike in 2002.  Another feature of N Gram is being able to look at the works that include the word inputted.  I learned that many of the works that mention LaFayette are about Marquis de LaFayette and not about the school.  This makes much more sense.  So, unfortunately, the computer is not able to detect such discrepancies, but we can still use this as a tool to see how rich the history is of one school over another.