TEXT ANALYSIS WITH VOYANT 2.0
(This handout is adapted from Sinclair’s own workshop documentation and Dr. Jakacki’s handout from a faculty workshop at Bucknell this summer)
For this course module we will be using the second major release of Voyant Tools (2.0), which addresses several of the major shortcomings and irritants of version 1.0 (http://voyant-tools.org/).
In addition to performance improvements throughout, the search and filtering functionality have been vastly enhanced and Voyant now supports proximity and n-gram operations.
Outline of this document
- Getting Setup
- First Steps: Cirrus
- Next Steps: The Full Environment
- Bring Your Own Texts
- Getting to Know the Tools
- Exploring More Tools
- Advanced Search Functionality
- Exporting URLs, Tools & Data
- Voyant Tools Roadmap
One of the strengths of Voyant Tools has always been that it’s freely and conveniently accessible online – there’s a hosted version that anyone can use (at voyant-tools.org, though we’ll be using a more recent beta version). There’s also a downloadable version of Voyant Tools that can be run locally and that has several potential advantages:
- You can keep your texts confidential as they will not be cached on our server.
- You can restart the server if it slows down or crashes.
- You can handle larger texts without the connection timing out.
- You can work offline (without an Internet connection).
- You can have participants in a group (like in this workshop) run their own instance without encountering load issues on our server.
For this module, we will be using the BETA version available on the “Golden Key” that was distributed to participants of Sinclair’s workshop at this last summer’s DH 2015 conference in Sydney, Australia.
- Insert the Golden Key into one of your USB drives and copy the zipped folder onto your desktop.
- Double-click on the zip archive to expand its contents.
- double-click on VoyantServer.jar
- on Mac, because of security restrictions on applications that aren’t signed and approved by Apple, you may need to Ctrl-click on the VoyantServer.jar file, select open from the menu, and then click open (not the default button) in the next dialog box
- you’ll need Java 1.7+ for this, your computer will tell you if you need to download Java
You can find more information about Running VoyantServer, including tips in case of problems.
You should see this page:
- Click the “Continue” button in the blue strip at the bottom of the dialog box.
- You should now see Voyant’s default home page
Voyant allows the user to open preloaded files, upload single or multiple text files (via file or cut-and-paste text), or input URL’s for texts that are available online. For the purposes of this workshop we’ll use files that have come bundled with Voyant.
Later on this week we will upload multiple text files that are available in our shared Google folder.
Experimentation Part 1, Austen’s Novels
- Click “Open” button. From the new dialog box use the pulldown menu to choose the “Austen’s Novels” corpus. Click “Open.”
- You should now see Voyant’s default dashboard.
Eight Jane Austen novels have been loaded:
- Love and Friendship
- Lady Susan
- Sense and Sensibility
- Pride and Prejudice
- Mansfield Park
- Northanger Abbey
Voyant is now using all eight novels to present corpus-wide analysis. The algorithm has been set to remove stop words like “a”, “the”, “and”, etc. Even still, you’ll see from Cirrus and Summary that the five most frequently used words across the eight novels are “mr”, “mrs”, “said”, “miss”, and “think”. These may or may not be useful to you. We can adjust and edit the stop words, which we’ll do in a moment. But first let’s examine the different panels.
Voyant Tools is an environment that can host different individual tools (like Cirrus) in different views and layouts. The default view of Voyant is composed of 5 panels where the tools interact with one another. Try opening the Austen corpus. If you click on a word in Cirrus, the Trends graph will update. If you click on a node in the Trends graph, the Contexts tool will update. Here’s a summary of the 5 visible tools:
- Cirrus: a simple wordcloud that displays the highest frequency terms in the corpus (that aren’t in the stopword list)
- Reader: a infinite scrolling reader for the actual text in the corpus (this fetches the next part of the text as needed)
- Trends: a visualization of word frequency across the corpus or within each document (depending on the mode)
- Summary: a high-level summary of data from the corpus
- Contexts: a list of occurrences of a specified word (this is sometimes called a concordance or a keyword in context)
Explore the visible tools (we’ll come back to the other tools later):
- what happens when you hover over the help icon? what if you click it?
- which tools trigger responses from which other tools?
- what scale is each tool (entire corpus, entire document, part of a document, etc.)?
- what is the visualization in the bottom of the Reader (middle-top) panel?
- try a simple search in the Reader panel
- what is relative frequency in the Trends tool?
- what are vocabulary density and distinctive words in the Summary tool?
- what does the plus icon do in the Contexts tool?
- what is the difference between context and expand in the Contexts tool?
** Note that the Reader, Trends, and Contexts panels all have search boxes that allow you to look for a particular word. Type “friend” into the Reader and Contexts panels – you can see visually and textually where that word appears across the novels. The Trends panel allows you to add multiple words by separating them with a comma. Currently this adding feature seems to reflect the corpus as a whole rather than an individual document.
To edit stop words, roll your cursor over the area above the Cirrus. You’ll see an icon with pop-up text that reads “Define options for this tool”. A dialog box will appear that allows you to choose and edit stop word lists and then apply them to the Cirrus and/or to other panels.
Voyant offers other tools that you can use to analyze text in different ways: in the upper left panel you can toggle between Cirrus and Corpus Terms; in the upper right panel you can toggle between Trends, Links, and Collocates. In the lower left panel you can toggle between Summary, Documents, and Phrases. And in the lower right panel you can toggle between Contexts and Bubblelines. Voyant also allows you to customize the dashboard so that you can reorder or choose which tools you want to work with. There are other tools to choose from by moving your cursor over the panel icon at the top right. You can also choose one tool to fill the entire dashboard area.
** If you want to return to the default dashboard view and to the original dataset, refresh your browser.
Experimentation Part 2, Dickens
- Clear your browsing data/cache and close the browser tab. This will remove the Austen corpus.
- Return to the Voyant Server dialog box. Click “Stop Server” button, and then “Start Server” button again. When you return to the Voyant home page, click on the “upload” button.
- Navigate to https://drive.google.com/drive/folders/0B_v8yVuozRJ1flpoRDJ2T2I0YmJSZkx4T1ZJTXk1THRYdW9oajZITzR4QkFtRkh5UTFSdzg folder and choose the Dickens sub-folder. These are texts Dr Jakacki has downloaded from Project Gutenburg and has “cleaned” by removing all metadata that might skew analysis.
- Choose Hard Times and click “Open”. Now you’ll see the dashboard with just one document open. Look for keywords.
- Now clear Voyant again (close tab, stop and start server) and upload all fifteen Dickens novels. When the dashboard opens again, you’ll see a corpus similar in scope to the Austen example above.
Again, look for keywords and word combinations. What patterns emerge across all the texts? Try comparing just two texts by uploading a selection to Voyant. When uploading files, you can now select multiple files at once by using the Ctrl and/or Shift keys.
Getting to Know the Tools
Each of the several tools in Voyant has its own particularities and peculiarities, but here are some general principles that apply to several tools.
Options. Many of the tools provide parameters directly visible (usually in the bottom part of the tool). The Contexts tool for instance (bottom right-hand corner of the default skin) has options for searching, for the context size (how many words to show on each side of the keyword in the table), and for expand size (how many words to show on each side of the keyword when you expand the occurrence by clicking on the plus icon in the first column of the row). In addition to these visible options, some tools also have additional options that can be accessed through the options icon in the top header. The Cirrus tool, for instance, has an option for modifying the stopword list.
Stopwords. The stopword list contains common words that usually have less meaning and are very common in most texts, such as determiners (“the”, “a”) and prepositions (“to”, “in”, “from”), etc. One person’s stopword is another person’s treasure, and it may be worth looking at the list of words to see if there are ones you’d prefer to show or if there are words that you don’t want to show and that should be added to the stopword list. You can edit the list by click on the options icon (in Cirrus, for instance) and clicking the edit button. Note that you can apply the newly selected or edited list to the current tool only or globally to all tools that support stopwords (globally is the default).
Voyant 2.0 now uses auto-detect by default so it’s no longer necessary to choose a stopword list (unless the auto-detect option doesn’t work for you).
Table/Grid Headers. The column headers in table/grid views includes functionality that may not be obvious. First, a help tip will appear when you hover over most column headers to briefly explain what that column is showing. Next, a down arrow will appear in the right part of the column header that and clicking on the down arrow will allow you to sort by that column (when possible) and to toggle the visibility of columns. Finally, if a column is sortable, you can also click on the header to toggle between ascending and descending order for sorting the table by that column.
Infinite Scrolling Tables/Grids. Tables can sometimes contain a huge number of logical items (for instances tens of thousands of terms in a document) which would be impractical to load at once. In Voyant 1 there was a paging mechanism that allowed the user to see 50 items at a time by advancing or rewinding by “page”. In Voyant 2 items are loaded on-demand as the user scrolls through the table – in most cases that should happen fairly seamlessly.
Corpus/Document Modes. Some of the tools can operate at variable scale, either showing data at the corpus level or at the individual document level – this can be a bit confusing if you’re not sure what you’re seeing. For instance, by default Cirrus shows top frequency terms for the entire corpus, but you can also generate a Cirrus from the terms of an individual document – one way to do this is to click on the Documents tab in the lower left-hand panel and click on one of the document rows. The Cirrus that appears will be for just one document, and if you want to revert to Corpus mode you can click on the “reset” button that appears in the lower right-hand corner of the Cirrus tool.
Resizing. The individual tool panels are resizable, the mouse pointer should change to a resize icon when you are hovering over the inner borders between tools and you can drag the border to resize. Similarly, the columns in table/grid tools are resizable.
Exploring More Tools
The way you access other tools in Voyant 2.0 has been improved and simplified, particularly with the introduction of tabs (multiple tools available from each panel) and the introduction of the tool switching menu.
In addition to the five tools that are displayed by default (Cirrus, Reader, Trends, Summary and Contexts), each of the five panels makes it easy to access additional tools, some of which we’ve mentioned already. Here are the other tools available from the tabs:
- Corpus Terms: displays frequency and distribution information for terms (types or unique words) in the corpus
- Links: displays a network graph of the collocates of keywords (the highest frequency terms that occur close to the specified search terms) – you can click on individual terms to fetch more terms and you can drag terms off the tool to remove them
- Collocates: similar to Links, but this presents collocates of search terms in a table form
- Documents: lists the documents in the corpus, including some metadata (where available, such as title and author), as well as counts of words/tokens, types and a ratio of types to tokens
- Phrases: lists the recurring phrases in the corpus (though any phrase must be repeated in a document before it is counted at the corpus level); this is a new tool in Voyant 2.0 and one of the most useful functions can be to see the longest repeating phrases (without having to specify a search query); note that there are different options for handling overlapping phrases
- Bubblelines: this is another representation of the distribution within each document in the corpus, it can be helpful for perceiving where different terms appear together (overlap)
All of these tools can be accessed through the tabs, but they can also be invoked from the tool switching menu (a windows-like icon) that appears when you hover over the header of any tool.
If you click on the tool switching icon a nested menu will appear. The first items will be a list of one or more tools that fit most naturally in that tool panel, but you can also navigate tools by scale (corpus or document) or by tool type (visualizations, tables/grids, other).
The skin header (the blue bar at the top) also has a tool switching menu which allows you to replace the entire page with one tool.
This is also a convenient way to access the ScatterPlot tool which provides a visualization of Correspondence Analysis or Principle Component Analysis (more complex analysis of how terms are shared between documents).
Note that some of the tools from the current 1.0 version of Voyant have not yet been implemented in version 2.0, such as TermsRadio, Knots, and Bubbles. Those should be implemented in the coming months, though some of the other tools may be abandoned, especially those that rely on Flash or Java.
Advanced Search Functionality
Much of the advanced search functionality is new in Voyant 2.0 – we’ll go through some highlights below.
Help with the search syntax is displayed when you hover over the question mark icon in a search box. The hovering tip box will disappear after a few seconds, and you can click on the question mark to have a dialog box appear until you dismiss it.
Search functionality is fairly consistent in all tools that support search. For experimentation, let’s work in the Corpus Terms tool (which is the second tab in the upper left-hand panel where the Cirrus wordcloud is displayed by default). These examples use the Austen corpus].
- exact match: think this searches the exact word (though it’s case insensitive, there’s currently no way to perform a case-sensitive search)
- wildcard match: think* this matches the root of a word and includes variants as a single term (think, thinks, thinking, etc.), note that for now wildcards can’t be used at the beginning of words and produces inconsistent results when used in the middle of words
- expanding wildcard match: ^think* this is similar to the previous wildcard match but this time each variant is counted and displayed as a separate term (this can be useful for seeing what terms are actually included in a wildcard match)
- multiple matches: think*, ^think* you can search multiple terms (two or more) by separating them by commas – a simple search might be for exact matches think, thinking, but you can also use more complex searches like think*, ^think* to get the best of both worlds form wildcard matches (counting the total wildcard matches as one term and also seeing the individual matches).
- combined matches: think|thinking use a combined match to merge two or more search terms into one result – this might be useful for counting singular and plural forms of a word, but not all wildcard forms (time|times but not timely, etc.)
- phrase match: “time enough” this matches an exact phrase or sequence of words – note the use of quotes (if you exclude the quotes you’re essentially performing a combined match for time|enough, though that may change in the future)
- proximity match: “time enough”~10 this is essentially a NEAR match, where the terms in quotes (there can be more than two) must occur within a specified number of words (in this case within 10 words, but you can specify a different number for the proximity); note that words can appear in any order, so enough might occur before time; it’s not possible to expand the match with the ^ operator like with wildcard searches, but you can use the Contexts tool to see the actual occurrences that are being matched
- multiple matches: time*, time|times, “time enough”~10 it’s possible to mix and match the different syntaxes, as with this example that has a wildcard match, multiple matches, combined matches, and a proximity match
Exporting URLs, Tools & Data
A distinguishing feature of Voyant Tools is its ability to generate URLs that can be bookmarked or shared and that point to a specific corpus with specific parameters.
The URL in the browser location bar will now update automatically after you create a corpus – you can bookmark or share this URL directly.
To export the URL from the current skin (combination of tools, not just one tool), click on the export icon from the top blue header bar.
This will cause a dialog box to appear with various export options, the first of which is a simple link that can be copied into the clipboard or clicked to open the URL in a new window.
The same basic process works for individual tool panels as well (if you just want to export or share, say, the Cirrus visualization), except that additional parameters are usually included with the tool panels (specific search terms that have been selected, for instance).
In addition to exporting a URL, you can also generate a bibliographic entry for Voyant Tools (if you wish to cite it, which would be awfully kind of you :), or if you want to export a live dynamic tool panel. The exported tool works much like a YouTube clip that can be embedded into any website – it pulls interactive content from a remote site. For both of these options, expand the “Export View” menu (see the image above).
The HTML snippet for a live tool might look something like this:
<!– Exported from Voyant Tools: http://voyant-tools.org/.
Please note that this is an early version and the API may change.
Feel free to change the height and width values below: –>
<iframe style=’width: 100%; height: 400px’ src=’http://beta.voyant-tools.org:80/?corpus=austen&view=Cirrus’></iframe>
Which should produce a live tool like this:
Important notes about URLs and embedded tools:
- During this workshop we’re using special instances of Voyant Tools that may not be accessible to others – that’s certainly true for a standalone (local) instance of Voyant running on your machine, but it’s also true for the workshop and beta URLs where corpora are less likely to remain accessible, unlike the current production version of Voyant Tools where corpora remain accessible as long as they’re visited regularly (at least once every three weeks).
- Embedding the HTML snippet may be a bit trickier with some Content Management Systems. In WordPress for instance, if you’re not an administrator, you may want to install a plugin like iframe.
In addition to exporting a URL or a embedding an interactive tool, Voyant provides some additional data exporting features, depending on the tool. For instance, some visualizations (like Cirrus, Trends, and Links) allow you to export data as graphics (a PNG or SVG), while the table-oriented tools (like Corpus Terms, Contexts and Phrases) allow you to export data in different formats (HTML, tab-separated values, and JSON). The tab-separated values can be especially useful since you can copy the generated output into a clipboard and paste it directly into a spreadsheet program (like Excel or Google Spreadsheets).
Note that in the current beta version it’s only possible to export the currently visible/loaded data, but that in a close future release it will be possible to export full datasets.
Voyant Tools Roadmap
Voyant Tools is an ongoing project and we’ll continue to improve and enhance the platform. Here’s a tentative roadmap for future development:
- by fall 2015 we hope to release Voyant Tools 2.0 and replace the current 1.0 version – some of the major remaining work includes:
- various bug fixes
- allow for adding and reordering documents in existing corpora
- adding a password protection for corpora
- backwards compatibility issues to ensure that existing Voyant URLs continue to function correctly)
- during fall 2015 and winter 2016 work will resume on Voyant Notebooks, a literate programming environment that allows a combination of writing, code snippets, dynamic tools, and other data output (more here). Voyant Notebooks is intended to leverage the existing analytic and visualization capabilities of Voyant while allowing users to customize some functionality and include a narrative description of their work
- ongoing work to summer 2016 on the next version of Voyant Tools that will include functionality for part-of-speech tagging, lemmatization, and topic modelling (some work has already been done on each of these, but was put on hold to ensure that Voyant 2.0 could be released)
Please feel warmly encouraged to help improve and guide further development of Voyant Tools by providing us with feedback, including bug reports and feature requests. You can follow the developments on Github, Twitter, or contact us directly (sgsinclair at Google’s email service).