Text Passages Tutorial

The text passages search gives you a new way to find smoking guns in the millions of tobacco industry documents. It is a search engine with two twists:

It uses short text passages instead of complete documents.

Usually, when we’re searching for multiple search terms, we’re only interested in sections where they appear in close proximity. For instance, if we’re looking for news reports that mention “nicotine” and “addiction” or “donald trump” and “tobacco,” we’re not interested in full documents that contain these terms but rather in short text passages where they appear in close proximity. Identifying these short sections is what the text passage search is for.

It identifies distinctive authors, terms, and topics.

In addition to returning the text passages that mention your search terms, the text passage search also analyzes the matching sections and generates a list of distinctive authors and terms as well as a topic model. These tools allow you to quickly narrow down your search or identify new topics worth exploring.

This document provides an overview of how the input parameters work, and how to use the various text analytics tools. If you're interested in a more applied introduction, have a look at this short case study.

Table of Contents

Parameters

Let's have a look at the parameters of the text passages search. Here's what the interface looks like.

Search Terms

All search terms or expressions of up to five terms should be entered in lowercase and without quotation marks. To search for multiple terms, separate them with commas.

Here are some examples of what you can search for:

cancertext passages containing the term “cancer”
lung cancertext passages containing the term “lung cancer”
lung, cancertext passages containing the terms “lung” and “cancer”, not necessarily next to each other (e.g. “cancer of the lung”)

You can also use an asterisk for wildcard searches

carcin*wildcard search, returns text passages that contain a word that starts with “carcin” like “carinoma,” “carcinogens,” “carcinogenic” and so forth
*genicwildcard search, returns text passages that contain a word that ends with “genic” like “carcinogenic,” “mutagenic,” “transgenic” etc.
lung, carcin*returns text passages that contain both “lung” and a term starting with “carcin”

Date Range

Range: 1901 to 2016.      Default: 1970 to 1990.

The text passages search lets you search for documents from 1901 to 2016. The year 1900 is excluded because it used to be assigned to undated documents. It should be noted that many of the documents within the industry's archives are misdated. Hence, it's always a good policy to open the pdf of the document and look around for dates: If a document from 1956 mentions the year 1978, it's probably misdated.

Passage Length

Range: 10 to 1000.     Default: 600.

The passage length parameter defines the length of the passages to be returned in characters. Each passage will be centered around the first search term with 500 characters before and 500 characters after that term. By default, passage length is set to 600.

Passages Per Year

Range: 1 to 2000     Default: 600.

Passages per year defines the maximum number of passages to be returned for every year.

By default, the text passages search returns 100 sections per year. If you want more detailed results, it can often be useful to increase this number to 1000. Note, however, that this slows down processing and may require loading up to 100 MB of data.

Minimum Legibility

Range: 0.00 to 1.00.      Default: 0.85.

Minimun legibility defines how “legible” each passage should be, which means, operationally, what portion of each text passage should consist of valid English words.

The rationale for this parameter is that the tobacco documents are full of OCR errors and misspellings, which lead to text passages that are almost illegible like “al t ourcc a of cancer xt c y ir cludo.” You can use the Minimum Legibility parameter to avoid getting such documents. It indicates the percentage of terms per passage that need to be valid English. For example, the default value of 0.85 means that 85% of terms need to meet that criterion.

Filters

Finally, you can use filters to select only specific collections, document types, or availabilities. Clicking on any of the filter types opens a drop down menu where you can make your selection.

For more information on filters, have a look at this tutorial.

There are three types of filters.

  • Collection filters allow you to only select documents from a certain collection, be it one of the tobacco companies or the trial documents held in the DATTA collection.
  • The document type filters enable similar limitations for the type of document that you're searching for. Note: We aggregated individual document types into groups so you don't have to grapple with over 200 document types. However, if you want to access individual document types, you can access them by clicking on one of the groups, which opens a drop-down that lists the individual document types.
  • The availability filters can be useful to search for formerly restricted documents by letting you select only documents that were previously confidential or under attorney-client privilege.

The filters are documented in more detail in a separate tutorial.

Usage

Let’s say we’re looking for the term “cancer” from 1901 to 2016. When we execute the search, we will get five items:

Clicking on any of the blue boxes will show or hide them.

Passages Per Year Chart

The passages per year chart shows how many sections were returned for each year.

Note that this chart does not indicate the number of times a search term appears across the tobacco documents. For such queries, you can use the frequency charts. For example, in the image below, there were more than 1000 text passages mentioning “lung cancer” after 1950, leading to a graph capped at 1000.

Text Passages

For every text passage, the search returns:

  • Document Date
  • TID
  • Collection
  • Author
  • Title
  • Text

Clicking on the TID opens the document in the TTID.

By default, the documents are sorted by the document date. You can change that sorting by clicking on the small arrows next to the headings.

To look at more passages, you can load the next set of documents by clicking on the next page at the bottom of the website.

Finally, you can search for passages that contain a search term.

Frequent Authors

The Frequent Authors window contains a list of the authors that appear the most often in the returned text passages.

Moving your mouse cursor over one of them indicates when they wrote those text passages. Clicking on any author loads only passages written by them.

Distinctive Terms

This window contains a list of terms that are distinctive For an overview of Dunning's Log-Likelihood ratio, see this blogpost by Ben Schmidt. or have a look at the original paper by Ted Dunning. for the text passages matching the search parameters. We calculate these words by comparing the selected text passages to the tobacco documents as a whole using Dunning’s Log-Likelihood ratio.

Clicking on any of the terms limits text passages to only those section that contain the selected term.

Topic Model

The job of a topic model Technical Note: We use Non-negative Matrix Factorization (NMF) to create the topic model. NMF is a dimensionality reduction technique, whose primary benefit for our purposes over probabilistic models like LDA is that it runs fast--it allows us to generate topic models with 100.000 passages in less than a minute. is to identify clusters of words that often appear together in the documents. Each of these clusters is called a topic.

While a topic The weight of a term is only interpretable in relationship to other terms. For example, in topic 1, the term "smoke" with weight 2.35 has twice as much influence in the topic as "skin" with weight 1.16
The weights are useful to identify topics that are dominated by a single term. For example, in topic 5, "lung" has ten times more weight than the other four terms combined, which means that it's really a topic about "lung" and not about the other terms in the topic.
does not have title, it's usually easy to figure out what the topic is about. For example, the terms "mice," "tar, "animals," and "skin" suggest that topic 1 is about skin painting experiments with animals. Similarly, topic two is probably about second hand smoke, also called environmental tobacco smoke (ETS.) However, you might also encounter topics that do not seem to make sense until you start looking at the documents that make up the topic. This happens particularly often if the same expression or sentence is repeated across hundreds of documents.

The topic models provide a succinct summary of the all text passages. Their main purpose, however, is to give you an additional way of filtering the results. Clicking on any topic model loads documents that score the highest on the topic. Hence, clicking on topic 1 gives us topics that discuss animal experiments.

Tobacco Analytics generates twenty topics for every query. However, only five are displayed at a time. To look at the other ones, use the left and right arrows next to the chart.

This should cover all of the major functions of the text passages search. However, if you have further questions or comments, feel free to contact me at risi@stanford.edu.