Comedy and AI: Analyzing the Phyllis Diller Gag File through Machine Learning

The American comedian Phyllis Diller donated a collection of 52,000 jokes to the Smithsonian Institution, known as the Gag File. Here, we use Python, machine learning, and natural language processing techniques to interrogate this large dataset and glean new insights.

Author

William J.B. Mattingly

Published

June 26, 2023

import json
import numpy as np
import regex as re
import dateparser
import pandas as pd
from umap import UMAP
from sentence_transformers import SentenceTransformer
import hdbscan

Introduction

Phyllis Diller (1917-2012) was an American actress and comedian of the mid-twentieth century. Her iconic comedic style was defined by her succinct, rapid one-liners that frequently challenged the inequities of her time, often through self-deprecation. Her career began in the 1960s and spanned four decades until her retirement in 2002. During this time, she did live stand-up, stared in movies, and did voice-over work (she was the Queen in It’s a Bug’s Life).

In 2003, shortly after her retirement, Diller reached out to the Smithsonian about donating some of her personal belongings. Soon afterwards, Dwight Blocker Bowers met with Diller at her home in California where she presented him with her “Gag File”, a collection of approximately 52,000 3x5 index cards in which she prepared, modified, and recorded her jokes during her nearly 50-years long career. (For more information on this Gag File and Diller’s career, see this blog post.)

After obtaining the index cards, it was Mike Wilkins and Sheila Duignan who supported their digitization. At this point, the Smithsonian Transcription Center began crowd-sourcing their transcription. Today, the transcriptions are complete and available to the public.

In this blog post, we will examine these index cards through a lens of machine learning and natural language processing (NLP) to see what kind of insights we can glean. To do this, we will leverage Python, a common programming language used in data science, machine learning, and NLP. We will also work with several libraries that support machine learning approaches to data, namely Sentence Transformers. The steps covered in this blog have been modified and compiled into a custom Python library, LeetTopic.

This blog post will serve two functions at once. First, it hopefully raises awareness about this rather unique dataset. Second, it will serve as a model for how to clean an existing dataset, give it new dimensionality, and glean new insights from it via machine learning.

To facilitate this, we will be able to view all the Python code used to clean, structure, and analyze the original dataset. This entire blog is written in a Jupyter Notebook which means anyone can access this notebook via GitHub and reproduce the steps performed here.

The Problem

All good research questions begin with a problem. In the case of the Diller dataset, our problem is twofold. First, how can we realistically analyze approximately 52,000 index cards? Second, what, if anything, can we learn about the data if we read it from a distance? These questions shall frame our quest.

Exploring the Data

The Diller dataset is available as a JSON file, a common method of transmitting data via the web. We can examine the data as a truncated spreadsheet below with Pandas, a common library in Python used for viewing and manipulating tabular data.

WARNING

Before moving forward, I would like to issue a word of caution. The Diller dataset contains some jokes that are culturally insensitive. Having read several thousand of these jokes, I have noticed that some are specifically homophobic, sexist, and racist. To understand how and why these jokes were kept in their original form, we can consult the Smithsonian Dataset Card. As we will learn throughout this blog, the methods we use can be leveraged to identify and isolate these culturally insensitive jokes.

Working with Unclean Data

To begin, let’s load up and analyze the Diller Dataset. We can do so with the code below which opens up the JSON file in Python and transforms it into a Pandas DataFrame. This will allow us to view the data more easily.

with open("data/pd.json", "r") as f:
    data = json.load(f)
unclean_df = pd.DataFrame(data)
unclean_df
id url content
0 trl-1488576309868-1488576322718-5 transasset:9000:NMAH-AHB2016q108500 MOVIE STARS: Phyllis DillerDiller...
1 trl-1488907532569-1488907580088-4 transasset:9120:NMAH-AHB2016q120798 FAMOUS PEOPLE: Phyllis DillerDille…
2 trl-1490027110884-1490027127889-12 transasset:9374:NMAH-AHB2016q145481 LOSER GAGDiller/MAR/1978…
3 trl-1489523118765-1489523129036-17 transasset:9287:NMAH-AHB2016q135760 PHYLLIS DILLER: COOKINGDiller Gag...
4 trl-1489595133716-1489595169321-5 transasset:9299:NMAH-AHB2016q137367 PHYLLIS: HAIRDILLER GAGS/19…
52201 trl-1488734717316-1488734731044-5 transasset:9051:NMAH-AHB2016q114811 UnknownDaley/MAR/1967,…
52202 trl-1488763510416-1488763520708-10 transasset:9060:NMAH-AHB2016q115715-01 UnknownLucas/DEC/1963…
52203 trl-1488792310527-1488792322189-0 transasset:9075:NMAH-AHB2016q116828-01 Unknown. L. Parker/MAR/1966…
52204 trl-1488921928068-1488921967987-7 transasset:9123:NMAH-AHB2016q121288 UnknownPayne/OCT/1969A…
52205 trl-1489094722980-1489094749205-4 transasset:9212:NMAH-AHB2016q125674 UnknownDaley/MAR/1967,…

52206 rows × 3 columns

In the above spreadsheet (in the bottom-left corner) we can see that we have 52,206 index cards each with three different pieces of metadata:

field description
id the unique id for the index card
url the url for the index card
content the raw text of the card

Each of these approximate 52,000 rows corresponds to a specific index card. For example, the first index card in this collection is the following:

Since we are primarily concerned with reading these index cards, we will be specifically focusing on the content category, or the main text area of the card. To get a better sense of these cards, let’s explore the raw text of the above image’s content.

print(unclean_df.iloc[0].content)
MOVIE STARS: Phyllis Diller
Phyllis Diller
10/MAR/1978
The place where Phyllis Dillers' star is on Hollywood Blvd went out of business.

Each line of this data is a different piece of metadata. The data on these cards usually consists of four sections:

line description
category (subcategory) These are the categories that Diller used. Occasionally, we see a subcategory given.
author This is the attribution of the joke.
date This is the data recorded on the card which is when the joke was first put in the Gag File
content This is the joke itself. Sometimes these are multiple lines.

Unfortunately, the original JSON file dataset does not separate out these important pieces of metadata, meaning in its current form, we cannot, for example, map the way in which Diller developed her style or the types of subjects that were discussed.

When working in fields like data science, this is a common occurrence. Most of the time spent in data analysis is spent cleaning the data.

Through Python, we can automate the conversion of the original dataset into a new one whose content section is broken down into four new structured pieces of metadata for each entry.

Cleaning and Structuring the Transcribed Data

We can clean the data from the content field using Python, Pandas, and Regular Expressions (RegEx). The rules below correctly formats all but 21 of the index cards. For our purposes, we will treat these as rare exceptions.

First, let’s grab all the jokes in the entire dataset. We will call this object jokes.

jokes = [item["content"] for item in data]

Our goal will be to create a new structured dataset of the original. For this reason, we will create a separate list called structured_data to which we can append a new, structured representation of the original raw text with each line separated out as a unique piece of metadata.

As we iterate over each joke, we first need to gather the individual lines. Let’s examine what this process looks like with the first index card. Remember, Python is a 0-Index language, meaning the first item in an index is item 0, not 1. As we can see below, there are four distinct lines, each of which contains a unique piece of data: category of the joke, author, date, and then the content of the joke.

for joke in jokes[:1]:
    joke = joke.split("\n")
    for i, line in enumerate(joke):
        print(i, line)
0 MOVIE STARS: Phyllis Diller
1 Phyllis Diller
2 10/MAR/1978
3 The place where Phyllis Dillers' star is on Hollywood Blvd went out of business.

Since we know that each piece of metadata will be on each of these lines (except for 21 index cards), we can write rules to automate the process of assigning each line index to its appropriate metadata category. Let’s see what this would look like in practice.

for joke in jokes[:1]:
    joke = joke.split("\n")
    if len(joke) > 3:
        header = joke[0].replace("\r", "").strip()
        author = joke[1].strip()
        date = joke[2]
        text = " ".join(joke[3:])
        
        print(f"Category: {header}")
        print(f"Author: {author}")
        print(f"Date: {date}")
        print(f"Content: {text}")
Category: MOVIE STARS: Phyllis Diller
Author: Phyllis Diller
Date: 10/MAR/1978
Content: The place where Phyllis Dillers' star is on Hollywood Blvd went out of business.

As we can see in the above example, we were able to correctly parse each line in the index card purely through Python. However, we have an issue with our date category. While we, as humans, can easily interpret the raw text 10/MAR/1978 as a specific date, there is nothing about this string of text that lets a computer know that this is a date. In other words, a valuable piece of information is entirely missing. In its current state, it would be impossible, therefore, to map all of Diller’s index cards over time. We can fix this, however, via the dateparser library in Python, specifically through the parse class which takes in a string (text) that contains a date and transforms that into consistent, structured Time Series data. While dateparser is a great first step, manual validation is essential. Some of the data may be formatted in a way that prevents dateparser from accurately formatting.

for joke in jokes[:1]:
    joke = joke.split("\n")
    if len(joke) > 3:
        header = joke[0].replace("\r", "").strip()
        author = joke[1].strip()
        date = joke[2]
        text = " ".join(joke[3:])
        
        #conversion to Time Series data
        date = dateparser.parse(date)
        
        print(f"Category: {header}")
        print(f"Author: {author}")
        print(f"Date: {date}")
        print(f"Content: {text}")
Category: MOVIE STARS: Phyllis Diller
Author: Phyllis Diller
Date: 1978-03-10 00:00:00
Content: The place where Phyllis Dillers' star is on Hollywood Blvd went out of business.

Notice the difference in the above output, specifically in the date section. Our date is no longer raw text, rather a proper representation of a date in a format that we can parse with a computer. We can combine this method with some basic text cleaning RegEx, or Regular Expressions, that format our data. Again, the RegEx formulas below are not perfect. Errors may (likely) remain and, again, manual validation is an essential step. For our purposes, however, we will work with the data in its current state.

structured_data = []
for joke in jokes:
    joke = joke.split("\n")
    if len(joke) > 3:
        header = joke[0].replace("\r", "").strip()
        author = joke[1].strip()
        date = joke[2]
        date = dateparser.parse(date)
        text = " ".join(joke[3:])
        text = text.replace("(RE: PD)", "").replace("RE: PD", "")
        text = re.sub(r'No\. \d{1,3}', '', text)
        text = text.replace("\r", " ").strip()
        structured_data.append({"header": header, "date": date, "author": author, "text": text})
    else:
        structured_data.append({"header": "UNKNOWN", "date": "UNKNOWN", "author": "UNKNOWN", "text": "UNKNOWN"})
print(len(structured_data))
52206

As we can see, our new spreadsheet has the metadata correctly aligned to four new categories:

  • header
  • date
  • author
  • text
structured_df = pd.DataFrame(structured_data)
structured_df
header date author text
0 MOVIE STARS: Phyllis Diller 1978-03-10 00:00:00 Phyllis Diller The place where Phyllis Dillers’ star is on Ho…
1 FAMOUS PEOPLE: Phyllis Diller 1978-03-10 00:00:00 Phyllis Diller The place where Phyllis Dillers’ star is on Ho…
2 LOSER GAG 1978-03-10 00:00:00 Phyllis Diller The place where Phyllis Dillers’ star is on Ho…
3 PHYLLIS DILLER: COOKING 1982-07-26 00:00:00 Phyllis Diller Gag The kids call my gravy boat the Titanic.
4 PHYLLIS: HAIR 1984-07-26 00:00:00 PHYLLIS DILLER GAGS (Phyllis’ hair and appearance) MY FALL FELL!!!
52201 Unknown 1967-03-17 00:00:00 Bill Daley Hello, Sweetheart. Yes, Mother’s in New York….
52202 Unknown 1963-12-27 00:00:00 Joe Lucas Reporter: What do you think qualifies you to b…
52203 Unknown 1966-03-25 00:00:00 R. L. Parker Aunt Frank… She’s one of my favorite uncle’…
52204 Unknown 1969-10-14 00:00:00 Barrie Payne IF ABE’S WIFE HAD BEEN ONE OF THOSE Dialog be…
52205 Unknown 1967-03-17 00:00:00 Bill Daley Hello, Sweetheart. Yes, Mother’s in New York! …

52206 rows × 4 columns

We can now merge this new spreadsheet with the original data to have a new, structured dataset.

final_df = unclean_df.join(structured_df)
final_df
id url content header date author text
0 trl-1488576309868-1488576322718-5 transasset:9000:NMAH-AHB2016q108500 MOVIE STARS: Phyllis DillerDiller... MOVIE STARS: Phyllis Diller 1978-03-10 00:00:00 Phyllis Diller The place where Phyllis Dillers’ star is on Ho…
1 trl-1488907532569-1488907580088-4 transasset:9120:NMAH-AHB2016q120798 FAMOUS PEOPLE: Phyllis DillerDille… FAMOUS PEOPLE: Phyllis Diller 1978-03-10 00:00:00 Phyllis Diller The place where Phyllis Dillers’ star is on Ho…
2 trl-1490027110884-1490027127889-12 transasset:9374:NMAH-AHB2016q145481 LOSER GAGDiller/MAR/1978… LOSER GAG 1978-03-10 00:00:00 Phyllis Diller The place where Phyllis Dillers’ star is on Ho…
3 trl-1489523118765-1489523129036-17 transasset:9287:NMAH-AHB2016q135760 PHYLLIS DILLER: COOKINGDiller Gag... PHYLLIS DILLER: COOKING 1982-07-26 00:00:00 Phyllis Diller Gag The kids call my gravy boat the Titanic.
4 trl-1489595133716-1489595169321-5 transasset:9299:NMAH-AHB2016q137367 PHYLLIS: HAIRDILLER GAGS/19… PHYLLIS: HAIR 1984-07-26 00:00:00 PHYLLIS DILLER GAGS (Phyllis’ hair and appearance) MY FALL FELL!!!
52201 trl-1488734717316-1488734731044-5 transasset:9051:NMAH-AHB2016q114811 UnknownDaley/MAR/1967,… Unknown 1967-03-17 00:00:00 Bill Daley Hello, Sweetheart. Yes, Mother’s in New York….
52202 trl-1488763510416-1488763520708-10 transasset:9060:NMAH-AHB2016q115715-01 UnknownLucas/DEC/1963… Unknown 1963-12-27 00:00:00 Joe Lucas Reporter: What do you think qualifies you to b…
52203 trl-1488792310527-1488792322189-0 transasset:9075:NMAH-AHB2016q116828-01 Unknown. L. Parker/MAR/1966… Unknown 1966-03-25 00:00:00 R. L. Parker Aunt Frank… She’s one of my favorite uncle’…
52204 trl-1488921928068-1488921967987-7 transasset:9123:NMAH-AHB2016q121288 UnknownPayne/OCT/1969A… Unknown 1969-10-14 00:00:00 Barrie Payne IF ABE’S WIFE HAD BEEN ONE OF THOSE Dialog be…
52205 trl-1489094722980-1489094749205-4 transasset:9212:NMAH-AHB2016q125674 UnknownDaley/MAR/1967,… Unknown 1967-03-17 00:00:00 Bill Daley Hello, Sweetheart. Yes, Mother’s in New York! …

52206 rows × 7 columns

final_df.to_csv("data/pd_final.csv", index=False)

Converting Texts into Vectors

Once textual data is cleaned, it is often good to get a quick sense of that data by viewing it collectively. It is important to remember that studying text in a computer system introduces certain limitations. Computers cannot parse language. They require texts to be represented numerically. This is true for all non-numeric data. With images, for example, we represent data not as a large picture, rather as a numerical value for each individual pixel of that image.

Language, however, presents certain challenges not found in image-based problems. Firstly, language is vastly more complex than an image. Language is built upon words and how those words function semantically (in context) can depend on word order and context. This means that if we want to represent text numerically and retain that semantic meaning, we must make sure that the numbers we choose to represent that text retain that semantic information. This is where natural language processing and machine learning come into focus.

Researchers have been exploring how to represent text numerically for decades. Nascent methods center around capturing a word’s meaning in a given context and representing that word as a vector. Vectors are numerical representations of text which make it possible for a computer to parse words while also retaining information about syntax.

There are many approaches for converting text to vectors. I have opted to do this via transformer-based models, the current state-of-the-art language models that can parse and understand some of the more challenging aspects of language, such as typographical errors, misspellings, and idiomatic expressions. I will specifically be using the all-MiniLM-L6-v2 model which is available from HuggingFace, the main Python framework for working with transformer models.

We can use language models, therefore, to create complex representations of our texts. We can then use machine learning algorithms to discover patterns, or clusters (topics) among those documents. To do this, we first need to load up our language model.

model = SentenceTransformer('all-MiniLM-L6-v2')

Once the model is loaded, we can use it to convert all of our texts into vectors.

doc_embeddings =  model.encode(final_df["text"])

Let’s take a look at the first text in the Diller dataset.

final_df["text"][0]
"The place where Phyllis Dillers' star is on Hollywood Blvd went out of business."

The all-MiniLM-L6-v2 model produces vectors with 768 dimensions. Let’s take a look at the first 10 dimensions of the first document to see what a vector looks like.

print(doc_embeddings[0][:10])
[-0.01073157 -0.08762725 -0.00820523 -0.00447153 -0.02199032 -0.01154062
 -0.04357602  0.03907308  0.06891748 -0.01843095]

Believe it or not, this is the precise same thing as the text above, just represented numerically. As we can see, this is not exactly easy to understand for humans, but for a computer this sequence of positive and negative numbers equates to the semantic meaning of a string of text.

Once we have converted all our texts into vectors, we can then begin to explore them in a graph. Before we can do that, however, we must first make these vectors readable to humans. At this stage, our vectors are high-dimensional representations of our texts (imagine hundreds of graphs that must be read simultaneously). In order to make these texts more readable, we must reduce the dimensionality of these vectors into two dimensions. There are many ways to do this today, but I am opting for UMAP (Uniform Manifold Approximation and Projection), an algorithm developed in 2018.

umap_proj = UMAP(n_neighbors=10,
                          min_dist=0.01,
                          metric='correlation').fit_transform(doc_embeddings)

With the UMAP dimensionality reduction complete, can send this output to HDBScan which is an algorithm designed to find patterns amongst our data. It will automatically cluster and allocate all documents into numerical topics or assign them as outliers (-1).

hdbscan_labels = hdbscan.HDBSCAN(min_samples=15, min_cluster_size=15).fit_predict(umap_proj)
print(len(set(hdbscan_labels)))
553

With these two pieces of data, we can load them into our original DataFrame and print off the results.

# Apply coordinates
final_df["x"] = umap_proj[:, 0]
final_df["y"] = umap_proj[:, 1]
final_df["topic"] = hdbscan_labels
final_df.head(1)
id url content header date author text x y topic
0 trl-1488576309868-1488576322718-5 transasset:9000:NMAH-AHB2016q108500 MOVIE STARS: Phyllis DillerDiller... MOVIE STARS: Phyllis Diller 1978-03-10 00:00:00 Phyllis Diller The place where Phyllis Dillers’ star is on Ho… 9.916232 9.252374 219

Notice that our first document has a topic of 50. This means that it has been assigned to a specific topic alongside other documents.

The above steps are frequently used to perform transformer-based topic modeling. Topic modeling is a technique in NLP where we try to find patterns across a collection of documents to find latent, or hidden, topics. In other words, topic modeling allows us to place our texts into specific categories automatically. Visualizing these results is, however, quite challenging.

Analyzing Clusters

We can use the result to analyze clusters of our data, also known as topics. Let’s take a look at one cluster, 10. As we can see from the output below, cluster 10 deals exclusively with Tarzan, the jungle, and monkeys. This is an interesting category.

res_texts = set()
res_topics = set()
times = []
for idx, row in final_df.loc[final_df["topic"] == 10].iterrows():
    res_texts.add(row.text)
    res_topics.add(row.header)
    times.append(row.date)
for res_text in res_texts:
    print(res_text)
To make Tarzan feel at home, the pageant officials gave him a microphone shaped like a banana...He kept eating it....
Last New Years Eve I was at a party with Dean, Ed Mc Mahon and Foster Brooks and it was disastrous. They all exhaled at once and dissolved Tarzana.
Tarzan and Jane bit -- they are 100 years old -- It's a shame when your pet chimpanzee has to support you.
What the mighty lord of the jungle killed with his bare hands, field mouse stew.  (Tarzan)
The Miserable Person: One who would truly love to but can't fart at all.  The Sensitive Person: One who farts and then starts crying.
Originally I was scheduled to take off with a monkey... but he wanted to run the whole show... high I.Q. or not I can't stand a pushy monkey.
Last New Year's Eve I was at a party with Dean, Ed Mc Mahon and Foster Brooks and it was disastrous. They all exhaled at once and dissolved Tarzana.
The Unfortunate Person: One who tries awful hard to fart but shits instead.  The Nervous Person: One who stops in the middle of a fart.  The Honest Person: One who admits he farted but offers a good medical reason therefore.  The Dishonest Person: One who farts and then blames the dog.
Tarzan, as you know, is the son of a British lord and lady, but he was raised by apes. The English really do have a labor shortage, don't they?
In this version, Tarzan lis a literate, softspoken, everyday sort of fella who talks to animals. I know a lot of people who talk to animals, but Tarzan waits for an answer.
You remember Tarzan. The fella who's always running around in the leather jockey shorts? ....Why do you think he has that awful yell? (GIVE TARZAN CALL) They are two sizes too small!
You remember Tarzan. The fella who's always running around in the leather jockey shorts? .......Why do you think he has that awful yell? (GIVE TARZAN CALL) They are too sizes too small!.....
I think Tarzan has got to make it. It's the only show on television where the star runs around topless!
The Sadistic Person: One who farts in bed and then fluffs the covers over his bedmate.  The Intellectual Person: One who can determine from the smell of his neighbor's fart precisely the latest food items consumed.  The Athletic Person: One who farts at the slightest exertion.
Wouldn't it be funny if what we thought was Tiny Tim's talent turned out to be undersized jockey shorts.
For the first time in his life Tarzan will dress fromal loincloth and cummerbund
Remember the early Tarzan pictures when he only knew six words? "Me Tarzan, you Jane, where Boy?" It sounded a little like one of L.B.J.'s speeches.
Tarzan outscored Bo Derek in an I.Q. test -- so did the monkey.
(Tarzan and Jane) - Strange name for an 80 year old, - BOY
(Tarzan and Jane) - strange name for an 80 year old, - BOY
I was always a little suspicious of Tarzan's span of concentration. Like, in one picture he said: "Me Tarzan. You Jane." And she said: "You Tarzan?" And he said: "Who?"
In this version, Tarzan is a literate, soft-spoken, everyday sort of fella who talks to animals. I know a lot of people talk to animals, but Tarzan waits for an answer.
How about Ron Ely, the old Tarzan, being named to replace Bert Parks? Tarzan can't sing. So this year the Miss America theme will be hummed by Cheetah..
I've known Tiny Tim ever since he was the Wicked Witch of the North.
Originally I was scheduled to take off with a monkey...but he wanted to run the whole show...high I.Q. or not I can't stand a pushy monkey.
Tarzan has been guzzling so much jungle juice -- I made him join Apes Annonymous.
Tarzan is so old--he was wears a support loin cloth.
Every matinee, all the dropouts would go to see the latest Tarzan picture. It was good for their morale. Next to him they all felt like William Buckley!
The Proud Person: One who thinks his farts are exceptionally fine.  The Shy Person: One who releases silent farts and then blushes.  The Impudent Person: One who boldly farts out loud and then laughs.  The Scientific Person: One who farts regularly but is truly concerned about pollution.
Did I get a shock on Thursday night. I turned to the wrong channel, got Tarzan -- and thought Batman had gone nudist!
Hollywood just named their 14th Tarzan. One of the losers was a muscular friend of Fang's. His sister. She would have had the part too, except for one bad break. Alligators. They were afraid of her.
But, now, Tarzan is a lot more sophisticated than he used to be. For instance, if he has to go someplace, he does it with style. He turns to the monkey and says: "Cheetah, call me a vine!"
I am always a little suspicious of Tarzan's span of concentration. Like, in one picture he said: "Me Tarzan. You Jane." And she said: "You Tarzan?" And he said: "Who?"
But, now, Tarzan is a lot more sophisticated than he used to be.  For instance, if he has to go someplace, he does it with style.  He turns to the monkey and says:  "Cheetah, call me a vine!"
Tarzan has been guzzling so much jungle juice -- I made him join Apes Anonymous.
Tarzan with water wings.
What the mighty lord of the jungle killed with his bare hands, field mouse stew.   (Tarzan)
The Foolish Person: One who suppresses a fart for hours & hours.  The Thrifty Person: One who always has several good farts in reserve.  The Anti-Social Person: One who excuses himself and farts in complete privacy.  The Strategic Person: One who conceals his farts with loud coughing.

What is particularly interesting is that Tarzan is not a topic identified by Diller herself. Instead, it is a topic that naturally developed over the course of her career. These Tarzan texts align across several different categories in the Diller collection.

res_topics
{'ANIMALS',
 'CAPT. BLI',
 'CELEBRITIES',
 'CELEBRITIES: Dean Martin, Ed Mc Mahon, Foster Brooks',
 'DIRTY JOKES',
 'DRINKING',
 'Drinking',
 'EATING',
 'Famous People',
 'Foundations',
 'Funny Names',
 'INSULTS',
 'Movie Stars',
 'Movies',
 'PD: Career',
 'PD: SMART',
 'SEX SYMBOLS',
 'SHOW BIZ',
 'Show Biz',
 'TELEVISION SHOWS',
 'TV SHOWS',
 'Topless',
 'Travel',
 'Unknown'}

Because we also have our date category, we can see when these jokes emerged in her career.

import plotly.graph_objects as go

tarzan_df = final_df.loc[final_df["topic"] == 10].dropna()
tarzan_df['date'] = pd.to_datetime(tarzan_df['date']).dt.year

# Group by date and collect all corresponding headers and texts for each date
grouped_df = tarzan_df.groupby('date').agg({'header': list, 'text': list}).reset_index()

# Define hover text
hover_text = []
for index, row in grouped_df.iterrows():
    hover_text.append('Date: {date}<br>'.format(date=row['date']) +
                      '<br>'.join('Header: {header}<br>Text: {text}'.format(header=h, text=t) 
                                  for h, t in zip(row['header'][:5], row['text'][:5])) + 
                      ('<br>...and more' if len(row['header']) > 5 else ''))

# Define trace
trace = go.Bar(x=grouped_df['date'], 
               y=grouped_df['header'].apply(len),  
               text=hover_text, 
               hoverinfo='text',
               marker=dict(color='rgb(158,202,225)', line=dict(color='rgb(8,48,107)', width=1.5),),
               opacity=0.6)

# Define layout
layout = go.Layout(title='Timeline Plot', xaxis=dict(title='Date'), yaxis=dict(title='Header Count'))

# Define figure and add trace
fig=go.Figure(data=[trace], layout=layout)

# Show figure
fig.show()

We can see from the graph above, that Diller has only three periods in which she makes the majority of her Tarzan category jokes: 1963-1965; 1968; and 1982. The 1977 spike is a false positive that references Tarzana, likely a reference to the Los Angeles suburb. Each of these spikes correspond to Tarzan either in cinemas or on TV. In 1963-1965, there were three Tarzan movies. In 1966-1968, the Tarzan TV Show ran; and in 1981, a new Tarzan movie hit the theaters. This data allows us to draw a connection between the timing of a joke and the cultural relevance of that joke.

Conclusion

The above methods are two ways that we can analyze a large collection of documents with machine learning to glean new insights. Clustering our texts allows us to understand the larger trends within a dataset, such as the example with Tarzan. Through semantic search engines, we can return more relevant results beyond keywords. Both of these approaches also benefit from two things: first, they both work at scale. Once the data has been processed via machine learning, retrieving the results or performing analysis on the vectors is computationally inexpensive. Second, both approaches are reproducible. These precise methods will work (with potentially minor changes) to other datasets at other institutions.

Returning to our initial prompt at the start of the blog, how do we make sense of large quantities of textual data? One potential answer is via machine learning which can provide new ways to interrogate documents, discover new patterns, and make a collection more discoverable.