1. overview and purpose
WebJigsaw is a visual analysis system that helps people search, explore, analyze, understand and understand collections of text documents. WebJigsaw presents several visualizations of the documents and the objects they contain, with a special focus on representing links between objects (objects that appear together in a document).
This web version of Jigsaw is designed to work best with collections of many documents that are relatively short. By many documents, we mean collections that can hold up to 2,500-5,000 documents, ideally about 1-6 paragraphs long, about one or two pages. Most important here is the number of named entities per document. This number should probably be less than about 50-75 units for WebJigsaw to be most helpful.
WebJigsaw is not intended to analyze a small number of extremely large documents such as books or scientific papers. This type of document should be divided into smaller units such as sections, subsections, or pages, and then each of these units becomes a separate document.
Since WebJigsaw offers many different visualizations of documents and entities, you should ideally have a good amount of screen real estate to display the views in your browser. You can still run the system on a smaller monitor, but you may be limited in the number of views that you can easily manipulate.
We have tried to keep this tutorial relatively short so that you can easily read and browse it, while at the same time providing the most important information necessary for effective use of the system.
1. first steps with WebJigsaw
This section will help you quickly familiarize yourself with WebJigsaw operation.
2.1 System requirements
We have thoroughly tested WebJigsaw on Google Chrome v55. It should work fine on all modern browsers, but the recommended browsers are Chrome and Firefox..
2.2 Initiating a session
Go to the WebJigsaw URL in your browser. The tab for uploading documents to the server should appear, as shown below.
2.3 Reading several documents
WebJigsaw can read (and save) documents from various formats. It can read original documents such as text and CSV. We have also created a jigsaw datafile format with xml that can be read. In addition, there are some specific proprietary document formats that WebJigsaw can read.
To import a source document that has not yet been edited, click the “Select Files” button. The Browser Import dialog box for selecting the documents to be imported opens. Alternatively, you can also drag the files to the “Store Files” area.
The main tab here is File Import. It allows reading plaintext (.txt), comma separated values (.csv), jigsaw datafiles (.jig). You can read multiple files in “txt” format, either all at once or by selecting them separately. Alternatively, you can upload a zip file (.zip), i.e. a compressed collection of “.txt” files. We did this specifically because uploading a single zipped file to the server is generally faster than uploading multiple txt files. For csv, a special mapping process starts, allowing you to specify what each column in the file means (more on this later in this document). We hope to soon have the ability to read pages and websites from webcrawls, web search pages, and bibliographic style pages.
The files you import can be plain ASCII text or Unicode. Since WebJigsaw can now read Unicode, texts from international (non-English) languages can also be processed in WebJigsaw.
We have created a simple proprietary xml file format for igsaw. If you have a specific type of data that you want to analyze in WebJigsaw, one way is to translate it to WebJigsaw’s Datafile format first. We have included some sample datasheets in the distribution, such as the 2007 VAST Symposium Contest documents, all paper abstracts from InfoVis and VAST papers, an example of PubMed papers on breast cancer and the Bible. You can read more about this in the next section and in the Annex..
When you import a document or set of documents, you can also perform entity tagging of the documents if you want. This is done on the Entity Analysis tab after clicking the Next button. You can simply leave the default selection or select the options according to your needs. (If you have many files and they are relatively large, the identification of entities can be time consuming, so be patient.) More information about entity identification can be found in this section.
In addition, you can also perform a mathematical analysis of the documents, if you wish. WebJigsaw provides document summary, document similarity, document clustering and sentiment analysis by default. The analysis of available systems is available on the Calculations tab after you click the Next button once you have finished Entity Identification. To learn more about calculations, read this section.
When WebJigsaw imports a set of documents, it builds an analysis database for those documents on the server. This is done so that WebJigsaw can scale up to large document collections. Note, however, that uploading and creating this database can be time-consuming when importing a number of documents, which may take several minutes. When the analysis is complete, this analysis database is referred to as the WebJigsaw project.
2.4 Displaying views
To start the analysis, you should probably start with a set of views. Once the preprocessing is complete, you will be directed to the Visualization tab. Here you will find a menu with the different views. You can choose which view you want. Note that you can create multiple instances of any view type. We recommend that you have at least one document view open at all times.
2.5 Start of Analysis and Exploration
To begin exploration, you can perform a query, selection, or execution of a command in a view. When you type a search term in the search box, WebJigsaw searches for that text and displays a document view that contains the documents with the existing text. The Documents search mode is useful if you want to search for a simple word (e.g., dog, car) that is not necessarily a unit. WebJigsaw behaves more like a simple search engine, displaying the documents that contain the search term.
2.6 Saving a session
You can save an already running analysis session by saving it as a jig file. This can be done with the commands available in the right menu below.
3. import and save documents
3.1 Import documents
WebJigsaw can import a variety of types of text files. Currently it can read ascii or comma separated values (.csv) and Unicode text (.txt). Simple ASCII or Unicode text files are the most reliable way to import files, so we recommend that you use text files whenever possible or convert your documents to text files whenever possible.
WebJigsaw considers source documents as all textual content. Generally, any text within the file is considered to be the body of the document. However, there are two exceptions. If WebJigsaw finds the string Date: or Source: followed by another text in a line within the top five lines of a file, it interprets this as a special metadata line and uses the following string
<DocSource> as the special fields or for the document.
To read multiple files at once, simply select multiple files in the File Selection dialog box using the Shift or Control Selection for your specific browser and operating system.
Import CSV files
WebJigsaw can also import .csv files. Also, it’s easy to generate .csv files from your.xls or.xlsx files.
Since the primary analysis unit in WebJigsaw is a document, you may inevitably wonder how such files are handled. In general, WebJigsaw considers each line of a sheet as a separate document. The columns in a spreadsheet can be attributes such as ID, date, or continuous text of the document (row), or can be some kind of entity. It is your responsibility to set up the assignment of columns to the relevant attributes. If you import one or more spreadsheet files, the CSV file will be displayed as usual.
When you define a mapping, you will see options as described below. You can define the attribute specified in each column by selecting the pull-down menu above that column. The menu contains items for the document ID, date, text, and common entity types such as person, location, and organization. This menu also allows you to create a new type of entity to be specified in a column. In the upper part of the dialog box, you can specify the row in which the actual data starts, ignoring some headers.
Some important points concerning spreadsheets:
- WebJigsaw can only read CSV (.csv) files and not Excel (.xls) or (.xlsx) files. We recommend that you convert them to.csv and use the files instead.
- When you create a new entity type in a column, the name of that entity type must contain only letters and numbers and must begin with a letter. Other characters are not allowed.
- If some of your cells are empty, the results may be unpredictable. Most of the time we believe that they will simply be skipped and it will “work correctly”, but to ensure success, try to have content for all cells.
- If possible, try to specify the document ID and attributes for the document text. Even if you select a simple text column as the document text, this is helpful. You can even create a new column in your spreadsheet that represents the union of a variety of other columns.
- If WebJigsaw finds duplicate document IDs in a sheet to be read, the last one will be used, the previous ones ignored.
Jigsaw data sheets
We have developed a proprietary file format for storing collections of documents that use xml. In addition to the text content of a document, this format can contain meta information about the document, such as an ID and a date, and it can contain a list of identified objects for each document. We call these proprietary files ‘Jigsaw Datafiles’ (.jig). We have provided a number of examples for you to view on the website.
If you may have your own data in an xml format, in a database, or in another format, it is not too difficult for you to translate it into WebJigsaw’s Datafile format. Read the appendix to this tutorial for more information and instructions on how to handle your own data. Trust us — it’s really not that bad. We did this to convert other xml files to WebJigsaw format and scratch web pages and create ‘Jigsaw Datafiles’ from them. Remember, however, that this is xml, so you can’t have characters like &, %, <, or > in your text. See the appendix for more information.
The first line of a puzzle data file can be a file type specification (e.g. Unicode UTF-8). WebJigsaw will read this specification and interpret the file correctly.
Note that if you create your own jigsaw data file and try to import it and the process fails or hangs, you are likely to have a syntax error in the file, such as an illegal character, a missing bracket, a mismatching open/close tag, etc.
As another option, if you have your own specific file format and are not sure how to insert it into WebJigsaw, please contact us and we may be able to write an importer for that file format or a translator from it into the WebJigsaw datafile format.
If you have imported documents from text files, spreadsheets, etc. and want to see them in Jigsaw Datafile format, you can use the Export command to output the current project as a Jigsaw Datafile.
3.2 Web-Jigsaw projects and workspaces
When a set of documents has been successfully read and entity identification has potentially been performed, this set of information is called a project. A WebJigsaw project encapsulates a set of documents that have been read in WebJigsaw, along with any entity identifiers that have been applied to them. You can save them and reopen them later in the WebJigsaw system by saving them as jig files.
When WebJigsaw imports a set of documents, it builds an analysis database for those documents on the server. This is done to allow WebJigsaw to scale to larger document collections. Note, however, that building this database the first time you import a series of documents can be time-consuming, which may take several minutes. However, subsequent analysis sessions will start much faster because this Web Jigsaw project/database is simply imported. These databases are stored in a database on our server and are stored for a maximum of one day since the last activity.
4 Identify and work with entities
The Preprcessing page contains an Entity Analysis tab that contains operations for the various entity processes described below.
4.1 Entities detection
When importing text files or spreadsheets, you can choose whether the system should automatically identify entities. Currently, WebJigsaw offers three main mechanisms for identifying entities in documents. First, it includes third-party software libraries to perform automated (statistical) entity discovery. Second, it includes the ability to perform some basic pattern matching of text to identify entity types such as data, phone numbers, postal codes, email addresses, URLs, and IP addresses. Third, you can specify an entity type (name) and a list of values for that entity type. We describe each of these options in more detail below.
For automated entity detection, WebJigsaw can use one of three possible packages. Polygot, Stanford NER and Spacy are included in the distribution, so in these cases the entity identification process is performed on the server. All packages have strengths and weaknesses, so we recommend that you try each one to see which one is best for your documents. We generally use the Polygot or Spacy-NER system and have found that it is generally quite fast.
WebJigsaw also includes features that can help you identify certain types of strings such as data, phone numbers, postal codes, email addresses, URLs, and IP addresses in the text of documents. This code performs some basic matches with regular expressions, so it’s not perfect. For example, a 5-digit number is identified as a postal code; we don’t validate it with all current postal codes in the US.
Finally, you can use WebJigsaw to create a new entity type and specify any valid strings that are the instances of that entity. For example, you can create a new entity type, Auto, and specify a range of possible values, such as Ford, Chevrolet, Honda, Hyundai, and so on. To do this, you must create a text file (.txt) that contains each possible entity value on a different line of the file. (Note that an entity value doesn’t have to be just one word, it can have multiple words.)
To then add this new entity type to WebJigsaw, use the bottom area of the Entity Identification tab. Simply type the name of the entity type on the left, and then search for the text file that contains the list of entity values. Note that entity type names (such as “Auto” in the example above) are case-sensitive, can contain only letters and numbers, and must begin with a letter.
The entity identification can be performed at the beginning of an examination after the first import of the documents.
4.2 Correction of incorrect entity identification
The process of automated entity detection is not perfect. Many false positives (identification of entities that are really not entities) and negatives (complete absence of some valid entities) can occur especially in documents with many spelling mistakes from processes like OCR.
WebJigsaw provides the ability to correct incorrect entity labels when you are on the Visualization tab. In the Document View, you can double-click an entity and you will have the menu at the top to change its type. You can also drag the mouse over words in a document to select them, and then use the menu to add the words as a unit. You can select one of the existing entity types or create a new entity type.
The list view also contains the Delete right-click menu command, which allows you to correct incorrect entity labels and remove an entity or entities. You can select multiple objects with Shift or Control click to remove multiple objects at once.
New entity types (names) are case-sensitive and must not contain spaces or other special characters. The entity type may only contain letters and numbers and must begin with a letter.
4.3 Aliasing of entities
WebJigsaw also allows you to create aliases for entities. Suppose a person’s name is written in a document collection in three different ways, but you know they’re all the same person. Alternatively, you can assume that a person is using an alias, i.e. there is another name for which they are going. WebJigsaw allows you to alias objects to deal with one of these situations. Entity aliases can either be defined interactively via the list views.
To create an alias interactively, select two or more objects in the list view and right-click to open a menu containing the Make Aliases command. Select this, and the system asks which of the entity names should be the main name to use for that alias. Once you have done this, all other child objects will be removed from the views and only this main name will be used. This “winning” entity name is underlined to indicate that it has aliases. When you move the mouse pointer over such an object, a pop-up view appears with the other aliases.
5. exploration and analysis of the document collection
Once you have imported a document collection, you are ready to examine, examine and analyze the documents and their units. Usually you want to create a set of different views to display the documents and entities. Note that you can have any number of views on one of the existing view types.
5.1 General information
- views show connections between entity and document and entity. A document and an entity are linked when the entity appears in the document. Two entities are considered connected if they appear together in at least one document. With the increasing number of documents in which they occur together, the quantitative connection strength also increases.
- A simple mouse click on an element (document or entity) selects this element. All other visible elements then update their appearance to show how they relate to the selected element. User mouse actions such as selections and extensions are also transferred to other active views, which also update their display accordingly.
- You can turn on/off event listening in any view by clicking on the small satellite dish in the upper right corner. Turning off hearing essentially freezes the view, i.e. user actions such as clicks and double clicks in other views have no effect on this view. This function is very useful for locking a view in an interesting state. Note that frozen views are also not affected by the Clear All Views command in the Views menu.
- To examine a document or set of documents containing an entity in an empty new document view, right-click the item and use the Show in new document view command.
5.2 Search tips
In document mode, which is invoked by checking the Documents checkbox, WebJigsaw simply retrieves documents that contain words from the search query anywhere in the document text.
5.3 View-specific application tips
The following sections briefly describe some of the utilities, commands, and features of the various views in WebJigsaw.
Note that each view above has its own menus that provide useful functions for that view. For example, some of the views have filter functions that allow you to restrict the display. All views have Change Title, Minimize / Maximize, Open in new Tab and the ability to hear.
The document view is the core view in WebJigsaw for reading document content. The list at the bottom left contains a number of documents loaded into this view. All documents are placed there by default. A document view can also be filled in response to search queries in the control panel, by show commands from other views, or by expand commands from other views. In addition, the Add All button at the bottom left brings all documents in the collection into view. Be careful when using this command with extremely large document collections.
Click any document name to select it and display its text in the focus area on the right. The number by the document ID indicates how often a document was viewed. All documents listed in this view participate in the word cloud at the top, which indicates the keywords used in this set of documents.
In the region above the actual document content is the “Document Summary”, which is a phrase from the document WebJigsaw selected to illustrate what the document is about. This can be useful for quickly evaluating multiple long documents.
Within the document focus area, the text of the document is displayed at the top, with all connected units that do not appear in the document text listed below. The elements are colored in a pastel shade of their default color. When you click an entity, it is selected. You can perform manual entity identification by dragging the selection of a word or words that select it with the mouse and then adding it as a new entity using the menu. You can also right-click an existing entity to access commands to remove it as an entity, change the entity type from the menu, or open a new document view that contains only documents that contain the entity.
When documents get bigger and bigger, they load much slower in the document view.
We find that the list view is the most powerful and useful view in WebJigsaw. It offers a very easy search, selection, filtering and investigation of all entities and documents in the collection to be analyzed.
The view starts with the display of three columns, but you can add/remove lists (columns) using commands from the Lists menu in the view, so that you can fill a large view with any number of lists. The view will be moved horizontally if there is not enough space.
Each column contains elements of a certain type – the type can be changed via the menu at the top of each list. The same entity type can also be inserted into different columns. Be careful with very large document collections with many, many units of a certain type, however. This can lead to a very long scroll list.
The bar to the left of an entity is a frequency counter across the entire document collection. If you move the mouse pointer over this small bar, you will find the exact number of documents in the collection in which this entity appears.
The buttons and menus above a column control how that particular list is displayed. The first three buttons sort the list in different ways:
2) according to frequency of occurrence in the entire collection or
3) according to the connection strength to the selected elements.
Other buttons control the alignment of elements and allow you to delete a list.
When you click on an entity, it is selected; Shift-click and Control-click allow you to select multiple entities. Selected objects are displayed in yellow. The elements associated with the selected elements are displayed in orange with darker shades indicating stronger connections. Non-associated objects are displayed with a white background. When multiple entities are selected, the 4 buttons in Utopia control whether Entity Connections via or’ing displays the selected entities and’ing displays the selected entities. In “And” mode, for example, connected objects (the orange ones) must appear in some documents with all selected objects together.
A right click on one or more selected entities provides a menu with a number of useful functions such as Display, Alias and Delete.
Dictionary tree View
This view is a version of the WordTree visualization introduced by IBM through the Many Eyes visualization page and its IEEE InfoVis Paper 2008. Here the WordTree applies to all documents in the collection. This view helps you understand the context of different words in the collection.
When you enter a term in the upper text input area, the system displays all subsequent words/contexts that follow it in a document. You can restrict the view to compress the entire string.
Documents Raster view
This view is useful to display a sorted and shaded list of all documents in the collection where the order and shading can communicate different metrics about the documents. The view starts empty, but documents can be added via show operations in other views, search queries or via the Add All button in this view. Each document is displayed as a small rectangle within the view. The documents are sorted line by line from top left to bottom right. You can use different metrics to control this order and the shading of the rectangle of each document. When you mouse over a document rectangle, its document ID and metric value are displayed to control the sorted order. Currently, only a number of different metrics are available: the size of a document, the number of entities in a document, the document date, the sentiment of the document, and the similarity of the documents to a selected document. If you select the checkbox at the top left, you can organize the documents by clusters (if calculated) and then sort and sort accordingly within these clusters.
The document grid view has a menu command at the top to print all the different documents in the view in the order they appear and with a metric for each document in a file.
Document group view
This view can be accessed from the Grid View tab. It provides a quick overview of the entire document collection by displaying each document in the collection as small rectangular icons in the window. If you have performed a document clustering calculation, the documents will be arranged in clusters.
5.4 Automated computational analysis
WebJigsaw offers a range of different automated calculation analyses that can help you explore document capture. It offers four important features: Document Summary, Document Similarity, Document Clustering and Sentiment Analysis.
To do this, you can select the appropriate commands on the Computational Analysis tab in the preprocessing phase. If you wish to use these analyses, we strongly recommend that you calculate them after entity identification. By default, clusters with a size that depends on the number of documents are used. Note that WebJigsaw blocks are blocked when performing calculation analyses and you cannot perform any other operations. The analyses can also take a considerable amount of time. With a document collection of five thousand documents, or larger documents, the analysis can take hours. In such a situation, we recommend that you start the analyses and then do something else in the meantime, perhaps even perform the analyses overnight and return to the investigation the next day. Below we describe each of the analyses and how WebJigsaw presents them.
The document compression is integrated in WebJigsaw in different ways. The document view shows a word cloud (above) of selected documents loaded in the view. The word cloud helps you to quickly understand topics and concepts within the documents by presenting the most common words in the selected documents. WebJigsaw removes common, simple words, but does not combine words such as “make”, “makes” and “making” (stemming) to highlight identified entities in the word cloud. The number of words displayed can be adjusted interactively with the slider above the cloud. In addition, the document view provides a summary of the displayed document with one sentence (the most important sentence). This summary of a document in a sentence is also available in all other WebJigsaw views. It can be displayed via a tooltip wherever a document is displayed as a symbol or name. The Document Cluster View also provides keyword summaries for the clusters.
In WebJigsaw, document similarity can be measured for the entire document text or only for the objects associated with a document. These different similarity measures are particularly interesting for semi-structured document collections, such as publications in which metadata-related units (e.g. authors or conferences) are not mentioned in the actual document text. The Document Grid View can give an overview of the similarity of all documents (in comparison to a selected document) to the order and color of the documents in the grid view. To do this, click on a document to select it, then go to the right menu and select the command to use it as a basis for similarity. Then go up to the right and make sure that the order and/or shading of documents in the grid is based on similarity. In all other views, the five most similar documents can be retrieved with a right-click command on a document representation. Note that we have noticed that the entity based similarity calculation sometimes crashes when some of the documents have a small number of (or no) entities.
Document grouping according to subject areas
WebJigsaw can also merge similar documents. Like document similarity calculation, document clustering can be based either on document text or on elements associated with a document. Calculated clusters are displayed in the Document Cluster View or the Document Grid View. Within the cluster view, there is a selection option for which clustering is to be displayed in the view. Each cluster is identified by three words/terms that describe some of the most important concepts within the cluster. In the grid view, select the top left option to organize documents by groups within the grid.
Document Content character / subjectivity analysis
The content character of a document is its general tone or mood – is it positive and optimistic or is it negative and angry? Subjectivity is the simple classification of a sentence or clause of a sentence as subjective or objective. Metrics about the mood, subjectivity, and polarity of a document can be displayed in the document grid view. Select the appropriate metric from the drop-down menus at the top right. One metric can be represented by the order of the documents, and a second metric (or the first metric again) can be encoded by the document color. To calculate the mood of a document, we use lists of “positive” and “negative” words and count the number of occurrences in each document. Puzzle displays positive documents in blue (more positive is indicated by darker blue) and negative documents in red.
If you need help using WebJigsaw, please email firstname.lastname@example.org.
We would be glad about comments and thoughts about the system in any case. We are particularly interested in finding out how you use the system and whether it will benefit you. Please let us know.
7. future work
The following enhancements are planned for the new release:
Recording and checking the examination history
WebJigsaw Datafile Format
Jigsaw datafiles (with suffix.jig) are xml files that encapsulate a set of one or more documents. Currently, for each document, the file contains the document ID, its date, any other documents it refers to, the source of the document and the actual text content of the document, as well as any entities identified in the document.
Eine Puzzle-Daten-Datei enthält ein extremes <documents> tag, the several elements <document> umschließt. Jedes <document> sollte eine <docId> enthalten und hat ein optionales <docDate> und andere Referenzfelder. The Klartextquelle/Inhalt the documents should be stored in field <docText> befinden and the identifizierten Entitätswerte how
<organisation> als Pfad. Beachten Sie, dass Sie in diesem Abschnitt auch andere Entitätstypen hinzufügen können.
There are some rules for entity types, values and other text in project files. Entity types must not contain spaces. Entity values and the report description text must not contain the &, <, >, and % characters because they are illegal in xml content. To insert these characters in text areas, use the following abbreviations:
- & – &
- > – >
- < – <
- % – %
An example of a puzzle data file containing a document is shown below.
<docDate>Feb 18 2004</docDate>
In the first action of its kind this winter, 18 bison were captured outside Yellowstone National Park on Tuesday and were being tested for brucellosis. Those that have signs of the disease will be sent to slaughter and the rest will be marked and set free, according to Karen Cooper, a spokeswoman for the Montana Department of Livestock. The bison, a mix of calves, yearlings and adults, were hazed into a pen just before noon Tuesday near Horse Butte, west of Yellowstone. The bison were then loaded onto trailers and trucked to another holding pen to be tested for brucellosis. Cooper said some of the bison had been hazed back into the park on Jan. 28, Feb. 5 and Feb. 13. "These were some of the same animals. We could not get them back in the park so today it was a capture operation," Cooper said. Several agencies participated in the capture, including the Department of Livestock, Montana Fish, Wildlife and Parks, National Park Service and the U.S. Forest Service. Through a state and federal bison management plan, government agents haze and sometimes capture bison that leave Yellowstone. The plan is intended to reduce the risk that bison will transmit brucellosis to cattle in the area.
<place>Yellowstone National Park</place>
<organization>Department of Livestock</organization>
<organization>Montana Department of Livestock</organization>
<organization>National Park Service</organization>
<organization>U.S. Forest Service</organization>