- Native digital text
- HTML
- RSS feeds
- Sample specific services:
- Tutorials for data collection from various services
- Digitized
- Internet Archive
- Project Gutenberg
- Google Books
- Hathi Trust (Hathi Download Helper)
- JSTOR Data for Research* (with Early Journal Content bundle, also from archive.org)
- PubMed Open Access Subset
- Monk Workbench*
- Document Cloud*
- Open American National Corpus (collection of American English from various sources)
- WordHoard* (tagged literary texts)
- Corpus of Contemporary American English
- British National Corpus
- Europeana
* – also has some processing/analysis capabilities
- Alan Liu’s collection of datasets at
- Google nGram :