Frequency Charts Tutorial
The frequency charts allow you to plot out the history of terms and expressions across all of the tobacco documents. On the one, the frequency charts can be used to trace the development of key concepts over time. When, for example, did most smokers learn that smoking was addictive? Given the over 14 million available documents, it is possible to find magazine articles from any given year that call smoking an addiction. With the frequency charts, it becomes possible to aggregate all of these individual documents and show that this transformation happened only in the late 1980s and early 1990s. If you are interested in learning more about this story, have a look at this case study. On the other hand, it can be used to trace the history of euphemisms, code words, secret research projects and institutions.
This document gives an overview of the functionality that the frequency charts provide and how to use the
interface. In addition, it will cover some of the possible pitfalls when interpreting the charts.
Throughout, we'll follow the trajectory of the term "young adult" in the industry's documents. "Young adult" was a
euphemism that tobacco companies first started to use in the 1970s to hide their marketing to
For a longer history of this and other youth marketing euphemisms, have a look at
Think of the
Children Young Adults!: Euphemisms in the Industry's
Marketing to Minors
it provides us with an excellent example to demonstrate the possibilites of this charting tool.
Table of Contents
Let's have a look at the parameters of the frequency charts. Here's what the interface looks like.
All search terms or expressions of up to five terms should be entered in lowercase and without quotation marks. To search for multiple terms, separate them with commas.
Here are some examples of what you can search for:
|young adults||plots the term "youg adults"|
|teenagers, young adults||plots both the terms "young adults" and "teenagers"|
You can also use an asterisk for wildcard searches
|young adult*||wildcard search, returns 2-grams that start with "young adult" including "young adult" and "young adults."|
|young adult *||Note the space between adult and *. This wildcard search returns 3-grams that start with "young adult" like "young adult smokers" or "young adult male."|
|* adult smokers||This wildcard search returns 3-grams that end with "adult smokers" like "young adult smokers" or "younger adult smokers."|
Finally, you can use filters to select only specific collections, document types, or availabilities. Clicking on any of the filter types opens a drop down menu where you can make your selection.
There are four types of filters.
- Collection filters allow you to only select documents from a certain collection, be it one of the tobacco companies or the trial documents held in the DATTA collection.
- The document type filters enable similar limitations for the type of document that you're searching for. Note: We aggregated individual document types into groups so you don't have to grapple with over 200 document types. However, if you want to access individual document types, you can access them by clicking on one of the groups, which opens a drop-down that lists the individual document types.
- The availability filters can be useful to search for formerly restricted documents by letting you select only documents that were previously confidential or under attorney-client privilege.
- With the term filter, you can limit your search to include only those 200 word passages that contain a specific term rather than searching across all documents. For example, by using "nicotine" as the term filter and "addictive" as the search term, we can get a sense of when these terms started to appear together. This is particularly useful when there are many ways of expressing the connection between these two terms. This search, for example, would capture expressions like "Nicotine is powerfully addictive." or "Experts question whether nicotine can be addictive."
The filters are documented in more detail in a separate tutorial.
For every search that we execute, we get a chart. Since we're interested in the switch from "teenagers" to "young adults, let's start by search for "teenager, teenagers, young adult, young adults".
Collections, and Document Types
But wait, we first need to constrain our search to only the document types that we are interested in. We are interested in internal discussions about youth marketing. Hence, it makes sense to limit the analysis to the document type groups "internal communication" and "marketing documents." There is no reason, for example, why changing legal rules should lead to a decrease of "teenagers" in the news reports that tobacco companies compiled.
Furthermore, we are interested in the documents held in the archives of tobacco companies but not, for example, the E-Cigarette Marketing Web Archives. Hence, we'll limit this analysis to documents from American Tobacco, British American Tobacco, Brown & Williamson, Lorillard, Philip Morris, and RJ Reynolds.
This gives us the following chart:
Looking at the graph shows a clear cesure around 1970: Before 1970, "teenagers" appeared quite frequently in the internal communication and marketing documents of the major tobacco companies. After 1970, "young adults" and "young adult" (as in: "young adult smokers") dominate.
If you move your mouse over the chart, you can see how often any of your search terms appear in any given year. This gives you a simple way to investigate if a spike is caused by a few dozen or a few thousand mentions of a term.
Let's now have a look at the different ways that we can display this data.
Display Lines: Stacked vs. Individually
By default, the display stacks the frequencies of the terms on top of each other to avoid displaying a labyrinth of criss-crossing lines.
However, this can make direct comparisons between two terms hard. For such comparisons, you can change the display mode to "Individually."
Frequencies, Counts, and Z-Scores
The frequency chart gives you three ways to display data: Relative frequencies, absolute counts, and z-scores.
By default, the term histories are displayed as relative frequencies. This means that that the number of times a term appears in a given year is divided by the total number of terms that our selected corpus contains in that year. For example, in our selected corpus, "young adult" appeared 2,894 times in 1979. In total, our corpus contained 70,413,625 words in that year. This means that the relative frequency of "young adult" for 1979 is 2,894 / 70,413,625 or 0.00411%.
The advantage of this display mode is that it displays the data as if we had the same number of documents for every year. The tobacco documents are heavily slanted towards the 1980s and '90s. In fact, they contain more documents from the 1990s than from 1900 to 1989. Hence, displaying the history of a term in absolute number would make it appear as if the use of most terms increased in the 1980s and '90s.
However, in some cases, it can still be useful to see the absolute number of times a term appeared. To do so, you can click on "Counts."
Finally, you can compare the frequencies of a term in the tobacco documents to those in the Google Ngram Corpus. This display mode requires a slightly longer explanation. The basic idea is straightforward: You will often find all kinds of spikes in frequency charts. In some cases, they reflect actual changes in tobacco industry rhetoric. In other cases, in particular in years for which we have few documents, the spikes might be caused by a misdated document. And in yet other cases, a pattern change might be caused simply because a new term comes into general usage. But how can we tell these cases apart? For example, "young adults," one of the terms of our study, was only gradually adopted from the 1950s to the 1990s. So how do we know that the rising frequency of "young adults" in the tobacco documents isn't just a reflection of the general adoption of the term?
The Z-Score allows us to investigate these cases. It provides an answer to the question: Does a term appear unusally often or rarely in the tobacco documents compared to everyday language? The null hypothesis that we test against is the assumption that a term should appear with equal frequency in both the tobacco documents and everyday language. In many cases, of course, a term will appear more often or less often in the tobacco documents than in everyday language and the z-score expresses this deviation by how many standard deviations the term is over- or underrepresented in the tobacco documents.
Naturally, "everyday language" is hard to operationalize. For tobacco analytics, we use the Google Books corpus on which the Google Ngram Viewer is based. The major advantage of this corpus is that it gives us access to yearly data and enables us to compare the history of terms in the tobacco documents against the history of the same terms in the Google Ngram corpus. However, it is worth noting that the Google Ngram corpus is slanted towards scientific books which come with their own specialized vocabulary; i.e. it is a proxy for but not an actual representation of everyday language.
But enough with the theory, let's have a look at what happens to our graph when we display it by z-score.
There are a number of observations to be made.
First off, the lines of "teenager," "young adult," and "young adults" remain close to zero from 1940 to 1970. Values close to zero indicate one of two things: Either the frequency of term in the tobacco documents is very similar to the frequency in the Google Ngram corpus. Or the term is so rare that any spike that we might see in the frequency chart is not statistically relevant. For example, five mentions of "young adults" cause a small spike in the frequency chart around 1950. The z-score graph shows, however, that this is a random fluke.
Secondly, the z-score graph shows that the rising frequency of "young adult" and "young adults" is not caused by the general adoption of these terms but rather that they are vastly overrepresented in our corpus. Namely, the frequency of "young adult" is up to 250 standard deviations from normal.
Third, the z-score of a term can be negative. This indicates that the term is underrepresented in the tobacco documents. In our case, this is most obvious for "teenager," which moves from close to zero in 1970 to minues 20 in 1995, indicating that the authors of the documents in our corpus avoided this term.
One final note: The usual cut-off point for statistical significance in a two-tailed test is a z-score of 2, giving p < 0.05. However, given that both the tobacco documents and the Google Ngram corpus contain billions of words each, we have found this to be too low However, we have found this to be too low because both the tobacco documents and the Google Ngram corpus contain tens or hundreds of millions of words per year. Hence, even small differences between these two datasets can reach a z-score of two. We would expect reliable patterns to reach z-scores somewhere between 10 and 50.
Terms, Collections, and Document Types
By default, the frequency chart shows the history of each term. However, it is often useful to know if a given term appeared particularly often in one collection or document type. We might want to know, for example, if a euphemism was only used by one company or if an additive appeared particularly often in internal scientific reports.
Here, we'll look just at the term "young adult. In the archives of which companies and in what document types did it appear most often? We will include all document types in this analysis but only include the collections American Tobacco, British American Tobacco, Brown & Williamson, Lorillard, Philip Morris, and RJ Reynolds.
If we want to know in what collection "young adult" appeared most often, we can switch the display from "Terms" to "Collections."
This quite clearly shows that "young adult" (often meaning: "young adult smokers") wasn't adopted across the industry, at least not at the same time. From 1970 to 1990, the term was used predominantly by RJ Reynolds employees with a brief spike of interest from Brown & Williamson in the mid 1970s. After 1990, the distribution evened out with Philip Morris at the top. (You can explanations for why these patterns appear in this essay on the industry's youth marketing efforts.)
Note: By default, the collections graph shows the frequency with which a search term appears in each collection. If you include small collections in your chart, this can cause misleading results. For instance, if we include the "Joe Camel Litigation" collection in the chart, it appears as if "young adult" appeared far more often in this collection than in any other one.
However, that spike appears because the Joe Camel collection contains documents related to RJ Reynolds' youth marketing efforts. Hence, it's not surprising that "young adult" appears very frequently in that collection. But it's also worth remembering that this collection is tiny compared to RJ Reynolds' main collection. To show that difference, it's useful to switch to absolute counts.
Finally, it's often useful to know in what kinds of documents a term appeared most often. When, for example, did "young adult" show up in marketing documents? And when did it come up in litigation documents? To answer these questions, you can switch the display to "Doc Types."
With the document types as with the collections, it's often useful to switch to absolute counts. The two allow us to answer slightly different questions. The frequencies display allows us to explore, for example, the percentage of marketing documents that mention "young adult." In contrast, the absolute count display gives us a sense of what proportion, among all mentions of "young adult" occur in marketing documents.
In this case, it's also worth expanding the chart to 2016 so we can see more clearly the rising number of court documents during the 1990s that mention "young adult," indicating that the industry's youth marketing efforts started to get used against them.
You can display the document types either "Grouped" or "Individually."
"Grouped," the default, means that the hundreds of individual document types are bundled together into major groups. The document type group "Scientific Publications," for example, contains the document types "Abstract," "Scientific Article," "Publication, Scientific," "Scientific Publication," "Technical & Scientific Publication," and "Article, Journal."
The goal of the document type groups is to provide a useful aggregation of the many individual document types found in the industry's archives. It also evens out differing document type designation. One company might label a scientific article "Publication, Scientific, " another one "Article, Journal" and so on.
However, if you want to see counts and frequencies of the individual document types, you can switch the display to "Individually."
Here, the 9 document types that contain "young adult" most often are represented. It is worth noting, though, that this display mode is normally less useful because the top hits tend to be ambiguous. We don't know in what kinds of reports or speeches "young adult" appeared. E.g. Were they marketing or scientific reports?
By default, the charts cover the years from 1940 to 1998. However, you can select the date slider to expand the range to a starting date of 1901 and an end date of 2016. We selected 1901 as a start date because in some cases, documents without a date were assigned the year 1900. At the other end, 2016 is the last date for which we had documents available.
There are a number of caveats to observe when expanding the date range beyond 1940-1998. To understand them, it's worth having a look at the total number of times the word "the" appears in the documents. Since the word "the" makes up about 4% of the total corpus, this gives a proxy to see how many documents belong to which year.
If you are thinking of expanding a graph to 1901, be aware that only a number of documents are available for each year. The main problem with that small number is that a single misdated document can cause a huge spike. For example, a collection of documents from 1988 was assigned the year 1912. [Author Files regarding the U.S. Surgeon General Report on Addiction] Since it contains more than 200 mentions of "addiction," the frequency chart of the term makes it appear as if addiction was an important term in that year.
Expanding graphs beyond 1998 can also be problematic. Most documents in the archive are the result of the 1998 Master Settlement Agreement. Hence, documents represented in the corpus after 1998 changes considerably. The primary sources of documents after 1998 were a) the RICO suit against Philip Morris, which was settled in 2006 and b) documents relating to other law suits, primarily the Engle Progeny Cases in Florida.
As a result, after 1998, court documents dominate the overall corpus, which is great if you are interested in those but may be misleading if you are not.
I hope this covers most of the functionality of the frequency charts. However, if you have further questions or comments, feel free to contact me at firstname.lastname@example.org.