Using Jigsaw
Overview
Jigsaw is a visual analytics system designed to help people browse, explore, analyze, understand and make sense of collections of text documents. Jigsaw presents multiple visualizations of the documents and the entities within them, with a special focus on showing connections between entities (entities that appear together in some document). Because Jigsaw provides many different visualizations of the documents and entities, you should ideally have a large amount of screen real estate to show the views.
This documentation is also available on-line in HTML format at http://www.cc.gatech.edu/gvu/ii/jigsaw/tutorial/manual.
If you prefer interactive video assistance over a printed document (and who doesn’t these days), you should watch the tutorial videos on the web page http://www.cc.gatech.edu/gvu/ii/jigsaw/tutorial. They should be helpful in learning how to use the system. The “System Requirements” video, in particular, provides assistance with getting the system set up and running on your machine.
Getting Started
You must have the Java Version 6 (or higher) installed in order to run Jigsaw.
Download the program at http://www.cc.gatech.edu/gvu/ii/jigsaw/
This site also has video tutorials and system views http://www.cc.gatech.edu/gvu/ii/jigsaw/views.html
- On a Windows machine, double-click on the Jigsaw.bat script file in the Jigsaw folder to start the system.
- On a Mac, double-click on the Jigsaw.command script file in order to start the system.
- On Linux/UNIX, execute the shell command file Jigsaw.sh.
This should bring up this Jigsaw control panel:
1.2 Reading in (or loading) a Set of Documents
Jigsaw can read in (and store) documents from a variety of formats. It can read original documents such as text, csv, html, pdf, Word, and Excel files. We also have created a Jigsaw Datafile format using xml that can be read in. Additionally, there are a few specific, proprietary document formats that Jigsaw can read.
To import a source document that has not been processed at all yet, use the File menu’s Import command. This will bring up the Import dialog box with tabs for the different types of documents that can be read in.
The main tab here is Files. It allows you to read in plain text (.txt), MS Word (.doc), PDF (.pdf), html (.htm or .html), comma-separated value (.csv), and MS Excel (.xls) files. To read in multiple files at once, use the Browse button and select multiple files in the chooser dialog box.
When you import a document or a set of documents, you also can choose to perform entity identification on the documents if you would like. This is done via a second dialog box that will pop up. (If you have many files and they are relatively large, entity identification can be time-consuming, so be patient.)
Entity Identification When importing text files or spreadsheets, you can choose to have the system automatically identify entities. For automated entity identification, Jigsaw can apply one of four possible packages.
The developers generally use the Illinois NER system and have found it to be quite good in general. I have found that LingPipe works well also.
The Process
When Jigsaw imports a set of documents, it builds an analysis database for those documents on disk. This is done so that Jigsaw can scale up to large document collections. Note, however, that the first time a set of documents is imported, building this database can be time-consuming, perhaps taking quite a few minutes. Once done, this analysis database is called a Jigsaw Project. Subsequent analysis sessions will be much faster to commence though since this Jigsaw project/database simply will be read in from disk. A Jigsaw Project file (.jp) is represented by a file in the Projects folder of the system and it encapsulates a set of documents that have been read into Jigsaw along with any entity identification that has been performed on them. Jigsaw also uses the concept of a Workspace. Workspaces include all the information of a Project, but they also include multiple Views that may have been active during an investigation. Jigsaw Workspaces (.jws) are encapsulated by a file stored in the workspaces folder. The File menu in the Jigsaw control panel includes commands for opening and saving Projects and Workspaces.
1.3 Displaying Views
To begin analysis, you likely want to start with a set of views. Go to the Views menu and choose whichever ones you want. Note that you can create multiple instances of any view type. We highly recommend having at least one Document View open all the time. Note that almost all Jigsaw views begin empty (the Document Cluster View is an exception). This is normal. You must perform a search query or do some command in a view in order to look for entities and/or documents and to begin to populate the views.
1.4 Start analysis and exploration
To begin exploration, you can enter a search term in the Control Panel. Jigsaw will look for any identified entities containing that text and will display an appropriate representation in each of the views present. This is the default Entities search mode which is initially selected. The Documents search mode is useful when you want to search for a plain word (e.g., dog, car) that is not necessarily an entity. Jigsaw acts more like a simple search engine for this, bringing up the documents that include the search string.
In general, there are three ways to populate and add information to the views:
- Issue search queries from the Jigsaw control panel.
- Right click on an entity or document and issue the Show command. This pushes that item out to be displayed in all the other views that are listening.
- Double-click on an entity or document which issues an Expand command and shows connected items in that view and other listening ones.
1.5 Saving a session You can save an analysis session already underway by saving it as either a Project or a Workspace. These commands are available under the File menu. See the next section for additional details about projects and workspaces.
- Exploring and Analyzing a Document Collection
Once you have imported a document collection, you are ready to explore, investigate, and analyze the documents and their entities. In all likelihood, you want to create a number of different views to show the documents and entities. Remember that you can have any number of views of any of the existing view types present.
Views show entity-document and entity-entity connections. A document and an entity are connected if the entity appears in the document. Two entities are considered to be connected if they appear in at least one document together. As the number of documents in which they appear together increases, so does the quantitative connection strength.
A single mouse click on an item (document or entity) selects that item. All the other visible items then update their appearance to show how they are related to that selected item. A double-click on an item expands the item – typically this shows connected items to it. User mouse actions such as selections and expansions also are transmitted to other active views which update their representation appropriately too.
You can turn off/on event listening in each view by clicking on the little satellite dish in the upper right. Turning off listening essentially freezes the view, that is, user actions such as clicks and double-clicks in other views will not affect this view. This capability is very useful to lock a view at an interesting state. Note that frozen views also are not affected by the Clear All Views command in the Views menu. To push an item (entity or document) out to all other open and listening (non-frozen) views, right-click on the item and use the Show command. Similarly, doing a search on a string through the Control Panel will push out all entities containing that string to all the active (listening) views. To examine a document or the set of documents containing an entity in an empty new Document View, right-click on the item and use the Show in new Document View command.
There are two search/query modes available through the two checkboxes under the query entry region in the Jigsaw Control Panel window. In Entities mode, Jigsaw will seek out documents containing the words from the query string in already identified entities within those documents. That is, only documents (and entities) will be retrieved that have the search terms in existing entities in those documents. This is the default mode. In Documents mode, which is accessed by selecting the Documents checkbox, Jigsaw simply retrieves documents that contain words from the search query somewhere in the document text. Note that different kinds of Boolean searches can be performed in Jigsaw. (For those in the know, we use Lucene to perform search.) When your search query has multiple words such as John Mary Bill you will be doing an “or”-based search, that is, finding documents that mention one or more of those words. You can also do other Boolean operations such as “john smith” AND mary which searches for documents having both “john smith” and “mary” in them.
- System Views
Jigsaw presents the individual reports in a document collection and the entities within those reports through a series of visualizations. We call these visualizations the system views. Below, we illustrate each view provided by the system and briefly describe their characteristics. Click on the individual images to see a larger version of the view. Also, a tutorial video illustrates the different views as well and the interactive behavior for each view can be seen on the video tutorial page.
All views share a Bookmarks menu which has commands to save any window and its state for resumption later. Also, in the upper right corner of a view is a small icon showing a satellite dish. This icon indicates that view is listening for system events and will update its presentation as new events occur. When the user clicks on this icon, a red line is drawn through it to indicates that the view is no longer listening to system events and thus will only change what is shown by direct interaction from the user. The icon is a toggle button so that clicking on it again will turn event listening back on.
Views in Jigsaw often show connections between entities across the document collection. Two entities are considered to be “connected” if they appear in at least one document together. Entities are considered more strongly connected as they appear in more and more documents together.
Control Panel – The Control Panel provides a variety of menu commands for use in the system and a search bar in which the user can enter strings to be searched, either as parts of entity names or as plain text in the documents. When a valid entity from the system is queried, all the visible views display that entity in the appropriate context of that view. When a plain text term is entered, all documents containing that term are loaded in the Document View. The Control Panel also displays number of documents in the collection being investigated, the different types of entities (each assigned a unique color), and the number of entities found of each type.
Document View – The Document View presents a set of documents from the collection. A list of the loaded documents is shown to the lower left, and the one currently selected for viewing is highlighted in yellow (its text is shown to the right). Every time a document is viewed, a counter increments to help the investigator keep track of readings. All the documents with grayed-in clouds in the left list contribute toward the word cloud at the top of the view which presents the key terms being mentioned across this set of documents. In the actual selected document view, named entities are colored in a background pastel shade of the entity color type shown in the Control Panel. The one sentence from the document that “best summarizes” the document is shown above the actual document text.
List View – The List View presents a set of lists of entities of different types. The user can add and remove lists through a menu command. Thus, a wider view window can support the display of more lists. At the top of each list are a set of buttons and a menu for controlling the appearance of the list. The menu allows the user to designate what entity type should be shown in that list. Note that the same entity type can be shown in multiple lists. Different buttons control features such as the justification of entities in the list (left, center, right) and the ordering of entities. Entities can be listed alphabetically, by frequency of appearance across the document collection, or by strength of connection to the selected entities The small black bars to the left of the entities indicate each entity’s frequency of appearance across the collection as well.
When the user clicks on an entity, it is “selected” and shown with a yellow background. Multiple entities can be selected within and across lists using control-click and shift-click as well. When an item or items are selected, all of the other entities update their appearance. If an entity is not connected to any of the selected entities, it is shown in the default white background. Entities that are connected to at least one of the selected items are shown with an orange background. Stronger connections are indicated by darker shades of orange. In addition, connected items in neighboring lists can be joined by lines to further indicate individual connections. As a list becomes longer and longer, many items may not be visible in the view. Consequently, a button is provided at the top of each list to bring all selected and connected items up to the top of the list.
The default display mode is “OR”. That is, when multiple entities are selected, other entities connected to any one or more of those selected ones are colored in orange. The viewer can change the mode to “AND” via a button in the upper right which means that only entities connected to each and every one of the selected items will be colored in orange.
Document Cluster View – The Document Cluster View represents all the documents in the collection as small rectangles. The user can drag and move individual documents or sets of documents to make different clusters. In addition, each query issued in the control panel adds a filter to the upper left region. The documents then can be segregated depending upon which of those terms they contain (different groups are assigned different colors). The View also contains buttons in the lower left to automatically cluster, based on similarity, the documents based on the source text of each document or the sets of entities per document. The button in the top left will highlight (via a yellow outline) all the documents in the collection that the analyst has read so far.
Graph View – The Graph View presents documents and the entities within them through a traditional node-link graph visualization. Rather than drawing the entire document/entity collection through one graph layout, Jigsaw provides an interactive exploration-style Graph View. Documents are slightly larger white rectangles and entities are slightly smaller circles, colored by the entity type. The entities within a document are usually drawn as a cloud around the document in which they appear. An entity is only ever drawn once, however, so entities in multiple documents are indicated by one circle that is connected to different documents (rectangles). When the user searches for an entity or issues an entity “show” command, that entity is added to the view.
The view is interactive so that the user can click on any document or entity and drag it to a new location. Dragging a document brings with it all the entities only connected to it. (Entities connected to other documents as well retain their position during such a move, however.) Double-clicking on a document is a toggle-style command that either shows or hides the entities connected to that document. Double-clicking on an entity displays all the different documents in which it appears.
When new entities are added to a crowded view, they may be positioned outside the current visible area, but the Jigsaw Graph View will automatically zoom out to make sure all are visible after the command. The Graph View also contains one special layout command, “Circular Layout”, that will reposition all the items in the view. Document rectangles are drawn at equally spaced positions around a large logical circle in the view. All the entities only appearing in one document are drawn outside the logical circle but near that document. Entities appearing in more than one document are drawn inside the circle. Entities appearing in the most documents are drawn closer to the center. The view contains many menu commands for filtering (showing and hiding) different types of entities as well.
Document Grid View – The Document Grid View represents all the documents in the collection as small rectangles in a grid. The analyst can control the ordering of rectangles (top-left to bottom-right) and the shading/color of the rectangles. Each of these attributes can be mapped to document attributes such as the size, number of entities, or date. Additionally, a particular document can be selected as the focus and then all other document’s similarity to it is another attribute to be visualized. Jigsaw also can perform a sentiment analysis of the documents and this attribute can be represented (blue-positive, red-negative). By selecting the button in the upper left, the grid is segregated into regions corresponding to different clusters of documents as computed by Jigsaw, and then the color and ordering are shown within each cluster.
Calendar View – The Calendar View presents different documents and entities from the data set in the context of a familiar calendar view. In the detailed view mode, the view shows years, months, weeks and days. In the more coarse view (shown here), the view just shows months and years. The small diamond items drawn on a particular day/month represent documents (gray) or entities (color mapping) in the context of the date(s) noted in document in which they appear. Documents or entities that are available to be shown in the Calendar are listed in the upper left. The default for an item is not to be visible. By clicking on the item’s name, the user can make it visible and add it to the calendar. The color of an item can be changed from its default entity type color to help differentiate different entities too. When the number of items associated with a day is too large to all be drawn in that region, a number is drawn indicating how many others appear on that day. As the user moves the mouse over that day, a larger rectangle pops up and shows all the items. When the user moves a the mouse cursor over a document-representation diamond drawn in the calendar, all the entities appearing in that document are shown on the lower left.
Timeline View – The Timeline View shows documents in the context of a timeline representation. Each document is represented by a “tower” of segments, each segment (thin horizontal slice) represents the entities within that document. When the viewer sweeps out a smaller region on a timeline with the mouse, that region is drawn above in more detail. This operation can be repeated multiple times to allow the viewer to see finer and finer context of a particular segment of time.
WordTree View – The WordTree View is adapted from the Word Tree visualization introduced by IBM researchers in the Many Eyes system. The viewer can enter at the top a word or words that appears in the document collection. The view then shows the context of that word, that is, the view shows all the trailing words that follow the search term(s) anywhere in the collection. Size indicates frequency, so larger branches indicate more repeated text usage. The Jigsaw WordTree View allows the user to see all trailing expressions (so the view may need to scroll vertically) or the results can be compressed and filtered to all fit in the current view without scrolling. This view helps the investigator to understand the context of a particular word or set of words in the document collection.
Scatterplot View – The Scatterplot View allows an analyst to place two different entity types on the two axes. Individual entities then can be filled in on the axes through search queries and interactive “show” commands. When a pairing of a plotted entity from the x axis and from a plotted entity on the y axis corresponds to a connection (ie, the two entities appear together in a document or documents), then a diamond is drawn at the crossing of their respective horizontal and vertical positions to represent that document containing both. The user can also assign particular colors to the different documents so that s/he can more easily see the different entity-entity pairings in a document. When an axis becomes crowded from too many entities being drawn on it, the user can use the two range sliders to narrow in on a particular region of the axis.
Circular Graph View – The Circular Graph View plots different entities from the collection around the circumference of a circle. Different entity types are grouped in different regions of the circumference (indicated by color). By clicking on an entity name, the investigator selects it (shown in bold) and lines are drawn to all of the connected entities. Multiple entities can be selected via control-click of the mouse button.
Tablet – The Tablet is not really a document/entity view like the others above. Instead, it is a window with Jigsaw that provides some basic evidence marshalling support and functions as an electronic notebook or tablet where the analyst can take notes, develop hypotheses, and organize his/her thoughts. The investigator can add relevant entities and documents from other views to the Tablet. These added items then can be linked together via lines (eg, to show a social network), can be connected to a timeline, or can have notes connected to them. Additionally, the state of different views in Jigsaw can be “bookmarked” and added to the Tablet. That state then can be recreated via this item. The Tablet also supports multiple tabs/windows to manage different parts of the investigation.
But wait, there’s more!!!
4 Automated Computational Analysis
Jigsaw provides a number of different automated computational analyses that can help you explore the document collection.
It provides four important capabilities:
document summarization, document similarity, document clustering, and sentiment analysis.
To employ these analyses, you must first instruct Jigsaw to calculate them. To do this, choose the appropriate command(s) from the Tools menu in the Jigsaw control panel. If you want to employ these analyses, we strongly recommend that you calculate them immediately after importing your documents and performing entity identification.
The “Compute All” command from the Tools menu will perform all of these analyses and when it completes, they will all be available for use. By default it uses clusters of size 20. Alternately, you can compute each of the analysis measures by itself. When you do this for the document clustering, for example, you are presented with more control options, ie, how many clusters and whether the clustering is text- or entity-based. (Note that if not enough documents or entities are present, Jigsaw may create a smaller number of clusters than what was requested.) Whenever you subsequently save your analysis in a Jigsaw Project, all the analyses will be there for the next time you invoke Jigsaw. Note that when you perform the computational analyses, Jigsaw blocks and you cannot perform any other operations. The analyses can take a significant amount of time too. For a document collection of five thousand documents or for larger documents, the analyses may take hours. In a situation like this, we recommend that you start the analyses and then do something else in the interim, maybe even run the analyses overnight and return to investigation the next day.
Below we describe each of the analyses and how Jigsaw presents it.
Document Summarization
Document summarization is integrated in different ways in Jigsaw. The Document View shows a word cloud (at the top) of selected documents loaded in the view. The word cloud helps you to quickly understand themes and concepts within the documents by presenting the most frequent words across the selected documents. Jigsaw removes frequent, simple words but does not combine words like “make”, “makes”, and “making” (stemming) in order to be able to highlight identified entities in the word cloud. The number of words shown can be adjusted interactively with the slider above the cloud. Additionally, the Document View provides a one sentence summary (most significant sentence) of the displayed document. This one sentence summary of a document is available in all other Jigsaw views as well. It can be displayed through a tooltip wherever a document is presented as a symbol or its name. The Document Cluster View also provides keyword summaries for the clusters.
Document Similarity
In Jigsaw, document similarity can be measured relative to complete document text or just to the entities connected to a document. These different similarity measures are of particular interest for semi-structured document collections, such as publications, in which metadata-related entities (e.g. authors or conferences) are not mentioned in the actual document text. The Document Grid View can provide an overview of all the documents’ similarity (compared to a selected document) via the order and color of the documents in the grid representation. To do this, click on a document to select it and then invoke the right menu and choose the command to make it as the basis for similarity. Then go to the upper right and make the order and/or the shading of documents in the grid be based on similarity. In all other views, the five most similar documents can be retrieved with a right mouse button command on a document representation. Note that we have found that the entity-based similarity computation sometimes crashes if some of the documents have a small number of (or no) entities.
Document Clustering by Theme or Topics
Jigsaw also can group similar documents together. Like the calculation for document similarity, document clustering also can be based on either the document text or on the entities connected to a document. Computed clusters are shown in the Document Cluster View or the Document Grid View. Within the Cluster View, there is a chooser for selecting which clustering is to be shown in the view. Each cluster is labeled by three words/terms that describe some of the main concepts within the cluster. Within the Grid View, select the option in the upper left to organize documents within the grid by cluster.
Document Sentiment Analysis
A document’s sentiment is its general tone or mood – is it positive and upbeat or is it negative and angry? Metrics about a document’s sentiment, subjectivity, and polarity can be displayed in the Document Grid View. Choose the appropriate metric from the menu selections in the upper right. One metric can be represented by the order of the documents, and a second metric (or the first metric again) can be encoded by the document color. To calculate the sentiment of a document, we use lists of “positive” and “negative” words and count the number of occurrences in each document. Jigsaw represents positive documents in blue (more positive is indicated by darker blue) and negative documents in red. You can use your own set of words to determine the positive or negative sentiment of the document as well. Within the Tools menu is a command to alter the sentiment dictionary. To do this, you simply create text (.txt) files with one word per line. The command then allows you to either replace Jigsaw’s own set of sentiment words with your own or to augment Jigsaw’s set of words with yours. Note that these two sets of words need not necessarily be related to sentiment also. You might, for example, create one set of words related to baseball and one set related to football and then the sentiment analysis view in the Document Grid can show whether a document is more baseball-oriented or football-oriented.
4.5 Gathering Evidence with the Tablet
Jigsaw provides a window called the Tablet that can help an investigator organize his or her thoughts, take notes, gather evidence, develop hypotheses, and so on. Below is a picture of the Tablet with some simple information inside.
An analyst can add entities and documents to the Tablet through right-menu commands in the other system views. Simply perform a right mouse menu click on an item and then choose the “Add to Tablet” operation. Entities are shown as small circles in their appropriate entity type color and documents are small rectangles. Notes (in pastel yellow) can be attached to entities and documents in the Tablet via a right menu command, or notes can simply be placed anywhere in the window by clicking and typing.
The analyst can manipulate objects in the Tablet via the usual Cut (ctrl-x or cmd-x), Copy (ctrl-c or cmd-c), Paste (ctrl-v or cmd-v), and Delete (delete key) commands.
The Connect command allows you to connect any two items with a line.
The Tablet also supports the creation of timelines (an example is shown toward the bottom here). To do so, select the Create Timeline operation at the top then click down in the window to start one endpoint of the timeline and drag to a position for the other endpoint and then release the mouse button. Events can be explicitly added to the timeline (right mouse click on the timeline and choose the Add Event operation), or other items in the window can be connected to the timeline (just drag and drop the item onto the timeline).
Additionally, bookmarks of Jigsaw views can be added to the Tablet window. Here, you see a Document View bookmark to the lower left and a List View bookmark to the right. You can add a view bookmark by invoking the “Add as Bookmark to Tablet” command from the Bookmark menu in any view.
The final operation at the top, Add Page, allows you to add new pages/tabs to the Tablet for your analysis. Below is shown a Tablet with multiple pages, two of which are illustrated. The first shows how you can construct a social network-style diagram from your analysis and the second shows a timeline-focused analysis display.
The Tablet contains a command for exporting the current tab to a PNG image file.
- Help/Comments
To read more about how Jigsaw works and to see a video demo, please refer to the web page http://www.cc.gatech.edu/gvu/ii/jigsaw. The web pages there, in particular the System Views page, tells more about the views. We would recommend reading the 2008 Information Visualization and the 2013 IEEE Trans. on Visualization and Computer Graphics journal papers (available at the website above) about the system for further help and explanation of Jigsaw’s purpose and how it works. The overview, example scenario, and tutorial videos on the top Jigsaw web page also should be especially useful in understanding how the system and views work (although the overview video is a bit dated now). The Tutorial Videos page on the Jigsaw website has many useful how-to videos about the system.
If you would like help using Jigsaw, please send email to stasko@cc.gatech.edu and CARSTEN.GOERG@ucdenver.edu. Also feel free to call John Stasko at (404) 894-5617 if an interactive dialog would be more helpful.
We would definitely like to hear comments and thoughts about the system. We are particularly interested to hear about the way that you are using the system and if it is beneficial to you. Please do let us know about this.