Terminology – Introduction to Text Analysis:

Collocation

In corpus linguistics, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance. In phraseology, collocation is a sub-type of phraseme. An example of a phraseological collocation, as propounded by Michael Halliday,^[1] is the expression strong tea. While the same meaning could be conveyed by the roughly equivalent *powerful tea, this expression is considered incorrect by English speakers. Conversely, the corresponding expression for computer, powerful computers is preferred over *strong computers. Phraseological collocations should not be confused with idioms, where meaning is derived, whereas collocations are mostly compositional.

There are about six main types of collocations: adjective+noun, noun+noun (such as collective nouns), verb+noun, adverb+adjective, verbs+prepositional phrase (phrasal verbs), and verb+adverb.

Collocation extraction is a task that extracts collocations automatically from a corpus, using computational linguistics. (source: wikipedia)

Concordance

A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era, only works of special importance, such as the Vedas,^[1] Bible, Qur’an or the works of Shakespeare or classical Latin and Greek authors,^[2] had concordances prepared for them. A concordance is more than an index; additional material, such as commentary, definitions, and topical cross-indexing make producing them a labor-intensive process, even when assisted by computers. (wikipedia)
Concordance is also a proprietary concordance program. Concordance is a comprehensive application with a number of powerful features, including multiple language support, user-definable alphabets, user-definable contexts, multiple-pane viewing, the ability to statistically analyze selected texts, and the ability to export concordance results as text, HTML, or Web Concordance files.

Entity recognition

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.^[1]

In computational linguistics, lemmatisation is the algorithmic process of determining the lemma for a given word. Since the process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence (requiring, for example, knowledge of the grammar of a language) it can be a hard task to implement a lemmatiser for a new language.

In many languages, words appear in several inflected forms. For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word. The combination of the base form with the part of speech is often called the lexeme of the word.

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications. (source: wikipedia)

n-gram

In linguistics, a sequence of n items from a given sequence of text or speech. N-grams can be any combination of letters, phonemes, syllables, words, or letters. A bigram sequence of the phrase “to be or not to be,” for instance, would break down as follows: to be, be or, or not, not to, to be. N-grams are regularly used in natural language processing and speech recognition.

OCR (optical character recognition)

The use of computer technologies to convert scanned images of typewritten, printed, or handwritten text into machine-readable text. This conversion allows for the computerization of material texts into formats for digital storage, search, and display. Adobe Acrobat Professional supports OCR processes, as does Microsoft Office for Windows (see Microsoft Office Document Imaging).

Tagging

TEI (Text Encoding Initiative)

A consortium that collectively develops and maintains standards for the representation of texts in digital form. In practice, the organization is chiefly concerned with producing and maintaining the TEI Guidelines for encoding texts in the humanities, social sciences, and linguistics. The TEI Guidelines, unlike other formats for preserving text, are a primarily semantic system; textual units are encoded according to what they are rather than how they appear.

text encoding

Broadly considered, the process of putting text in a special format for preservation or dissemination. In the digital humanities, textual encoding nearly always refers to the practice of transforming plain text content into XML. The TEI Guidelines are often followed when encoding textual materials in the arts, humanities, and social sciences. See TEI.

text mining

The process of automatically deriving previously unknown information from written texts using computational techniques. Textual-mining tools facilitate researchers’ discovery of patterns within structured data.

Tokenizing

Wordle

A simple text-visualization tool that produces a word cloud, where the size of individual words corresponds to frequency of appearance in a given corpus. The font, layout, and color scheme of the resulting display can be altered by a user. Wordle is also accessible through TAPoR.

Word frequency