Corpus Creation

Preparing your Corpus

As you work on preparing your corpus, here are a few things to bear in mind.

a) What kinds of texts are you selecting?

  • Literary?
  • Philosophical?
  • Journalistic?
  • What was the original format of those texts?
  • Are they digitized printed books or articles?
  • Are they born-digital?
  • Are they transcriptions of the spoken word?

b) How many texts will give you a representative sample?  According to the literature, 10 text samples from each register should give you good data.  As you select those 10 bear in mind the following:

  • Are you including a range of texts that show the full range of variability?
  • If you are constructing a corpus which contains a range of texts, what is that range?  i.e. if you are sampling journalistic prose, are you choosing from “high brow” sources (NYT; BBC) and also popular media (CNN; US and World Reports; Huffington Post)?
  • If you are comparing political speeches, are you comparing the two ends of the political spectrum?
  • If philosophical, are you comparing between schools of thought?  Centuries
  • If literary, are you comparing literary genres, periods, authors?

c) How are you selecting these texts? You should document your process and your sampling decisions.

d) How long are your texts?  Voyant can manage large document collections well.  Jigsaw can manage large collections of smaller documents; the best size of each individual document is, according to Jigsaw’s  developers, approximately 10 pages.

e) What are you looking for?  Are you looking for dominant terms?  Are you looking for vocabulary density? Are you looking for stylistic patterns?  Repeated phrases? Are you looking for connections and collocations of terms?

Remember, Jigsaw is designed to look for connections between entities (people, places, organizations: you can also define your own set of entities, such as “cars”).  When you are creating your corpus, think about how “entity rich” your documents might be.

Remember this is a sequential and iterative process

Initial formulation of research question  —->

  • Corpus design —->Compilation of corpus —->Empirical investigation —->    repeat