Resources

An excellent resource for getting to know the field of textual analysis is at Duke University library’s site, found here.

Text analysis

jiayu's network
Network visualization of terms in Wikipedia’s RPG game descriptions (by Jiayu Huang for HUMN 270 Fall 2015)

Voyant is the best multi-tool text analysis platform for a start.  The version that is online is the earlier release and can be found as part of the suite of tools that can be found here.  There is a new version of Voyant that brings these different platforms into one interface and which doesn’t require switching between tools.  If you want to use that, ask me.  I have a version on my thumb drive.

Voyant is very good as a concordance and frequency analysis visualization tool.  It can work with large amounts of text in multiple files.  You can compare aspects of different texts easily.  For example, which words come up most frequently in which texts; which terms are collocated; what are the vocabulary densities of different texts?

Here is a tutorial for Voyant 2.0

There are also sites/tools for analyzing large amounts of text data from a macro or high level perspective:  for example, Google Ngram viewer which visualizes word frequencies in the corpus of Google digitized books (in multiple languages)  and Bookworm which visualizes trends in repositories of digitized texts.

Topic Modeling

Screenshot 2015-09-27 19.26.38Topic modelling is a method by which your text is chunked into pieces and a computer works out what the most important topics are in the chunks.  The algorithm is not interested in meaning, just in related concepts.  The best tool for this is MALLET; a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. MALLET includes sophisticated tools for document classification: efficient routines for converting text to “features”, a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

But if you are not comfortable with command line programming there is also an online version that can work for smaller amounts of text.  That can be found here.

There is also a nice demo tool that can be used to identify topics, themes, sentiment, concepts at AlchemyAPI

Screenshot 2015-09-27 19.29.27Miriam Posner has written a great blog about how to interpret the results from Topic Modelling outputs.

Here is the Mallet tutorial most recently used by TAMU workshop

 

 

Directory of Tools

DIRT

Natural Language Processing Toolkit

Textbook on introduction to NLTK can be found here.

Stop words

Lists of stop words in different languages can be found here

Parts of Speech Tags

POS Tags for Stanford NLP tagguide copy

Readabilty score

http://read-able.com/