Introduction to Text Analysis:

HUMN 100-04 Spring 2016

Text sources

Native digital text
- Email
  - (Thunderbird extension, MUSE*)
- HTML
- RSS feeds
- Sample specific services:
- Tutorials for data collection from various services
Digitized
- Internet Archive
- Project Gutenberg
- Google Books
- Hathi Trust (Hathi Download Helper)
- JSTOR Data for Research* (with Early Journal Content bundle, also from archive.org)
- PubMed Open Access Subset
- Monk Workbench*
- Document Cloud*
- Open American National Corpus (collection of American English from various sources)
- WordHoard* (tagged literary texts)
- Corpus of Contemporary American English
- British National Corpus
- Europeana

* – also has some processing/analysis capabilities

Alan Liu’s collection of datasets at
- http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets
Google nGram :
- https://books.google.com/ngrams
- Culturomics: http://bookworm.culturomics.org/