NLP Analysis: National Public Radio Transcripts¶

By Eric Stromgren | November 5, 2021

Table of Contents¶

Executive Summary
Exploratory Data Analysis
NLP Analysis

Executive Summary ¶

Summary
This natural language processing (NLP) analysis of National Public Radio (NPR) transcripts samples 'All Things Considered' two-speaker conversations to reveal frequently discussed topics by show guests for the first quarter of 2019. It also classifies sentiment for hosts and guests during the sample period.

The analysis revealed a high frequency of discussions involving United States political topics. Frequent subjects included President Donald Trump, the Trump Administration, U.S. southern border security, federal minimum wage, Attorney General Bill Barr, Attorney General Rod Rosenstein, the special counsel investigation by Bob Mueller, North Korea, Saudi Arabia and the Islamic State.

Guest sentiment was more likely to be positive or neutral than negative, while host sentiment was more likely to be netural than positive or negative. A hypothesis for this result is guests responded more positively to a more objective host interview style.

Methodology
The primary method driving this NLP analysis is the data transformation of 'All Things Considered' program conversation speech into n-grams. This approach allows for identification of frequent content topics during the sample period.

The sample period, the first quarter of 2019, includes 201 'All Things Considered' episodes containing 6,926 guest utterances and 4,672 host utterances. The sample period of one quarter is chosen as a processing example and is narrow enough to provide relevant slice-of-time insights. Ultimately, other time slices can be used to analyze topic frequency shifts over time. For this case, processing the entire 20-year population of transcript data did not seem logical because doing so would likely hide seasonality. Topics discussed in 1999 may not be relevant in 2019, for example.

A sentiment analysis using the out-of-the-box VADER model from the NLTK Python library is used to reveal attitudes of the guests and hosts. One would anticipate a mainstream media outlet to practice objective journalism, and there is some support for this hypothesis as host sentiment was classified 34% positive, 49% neutral and 17% negative. Media guests likely have some agenda or bias to push, and there is some support for this hypothesis as guest sentiment was classified 39% positive, 39% neutral and 22% negative. In sum, hosts tended to be more neutral than guests while guests tended to react with more positivity than negativity.

In conclusion, the NPR dataset has similarities with a telecommunications company customer support call center. This analysis is viewed through the lens of a call center structure where company agents and customers interact through conversation to resolve account-related issues. NPR programs can be viewed as a proxy for the company's lines of business, NPR hosts can be viewed as a proxy for call center agents and NPR program guests can be viewed as a proxy as customers. This type of an NLP analysis focusing on topic frequency can help a telcommunications business surface potential product feature issues experienced by customers and identify sentiment.

Next Steps
Understanding what customers are frequently talking about – NPR guests in this case – is a logical first step in identifying product features impacting customers. The sentiment analysis can be improved by developing a customized sentiment dictionary tailored for this specific NPR dataset. A customized dictionary would likely improve sentiment scoring classification accuracy.

This analysis focuses on one line of business – 'All Things Considered' in this case – but could be easily applied to other lines of business across different time segments to answer various business questions. For NPR, an increase in topic frequency over time may reflect the movement of public interest and programming decisions. For a telecommunications call center, a topic frequency decrease may indicate lower call volume and decreased friction between product features and customers.

Exploratory Data Analysis ¶

Dataset Descriptions ¶

Source: https://www.kaggle.com/shuyangli94/interview-npr-media-dialog-transcripts/

episodes.csv: Metadata about all scraped episodes of NPR podcasts, from 1999 to 2019

headlines.csv: Headlines along with their ID.

host-map.json: Dictionary of host ID: name (lowercase name), episodes (list of episode IDs hosted), programs (list of programs hosted)

host_id.json: Dictionary of lowercase host name : host ID

splits-ns2.json: Dictionary of split ("train", "valid", or "test") : list of episode IDs for that split

utterances-sp2.csv: Utterance-level breakdown of all 2-speaker conversations

utterances.csv: Conversation turns for every episode (multi-speaker included)

Python Setup ¶

###libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import nltk
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from wordcloud import WordCloud

#Increase cell width for browser view on Jupyter Notebook
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

#Display Max Rows on Dataframes = 100
pd.set_option('display.max_rows', 100)

C:\Users\Owner\Anaconda3\lib\site-packages\nltk\twitter\__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
  warnings.warn("The twython library has not been installed. "

#read JSON
with open('D:/data_projects/npr_analysis/data/host_map.json', 'r') as myfile:
    data=myfile.read()

# parse file
obj = json.loads(data)

#display(obj)

Raw Data ¶

##Read Data
episodes = pd.DataFrame(pd.read_csv('D:/data_projects/npr_analysis/data/episodes.csv'))
print("Episodes")
display(pd.DataFrame(episodes.head(n=3)))

headlines = pd.DataFrame(pd.read_csv('D:/data_projects/npr_analysis/data/headlines.csv'))
print("Headlines")
display(pd.DataFrame(headlines.head(n=3)))

utterances = pd.DataFrame(pd.read_csv('D:/data_projects/npr_analysis/data/utterances.csv'))
print("Utterances")
display(pd.DataFrame(utterances.head(n=3)))

utterances_2 = pd.DataFrame(pd.read_csv('D:/data_projects/npr_analysis/data/utterances-2sp.csv'))
print("2 Speaker Utterances")
display(pd.DataFrame(utterances_2.head(n=3)))

Episodes

Headlines

Utterances

2 Speaker Utterances

episodes.csv ¶

#Count of Shows
print("Episode Count: " + str(episodes.shape[0]))

#Program Counts with Visualization
display(pd.DataFrame(episodes['program'].value_counts()))

#Program Count Chart
display(episodes.program.value_counts().plot(kind = 'bar'))
plt.style.use('seaborn')
plt.title("Program Counts")

Episode Count: 105848

<matplotlib.axes._subplots.AxesSubplot at 0x16213b67a58>

Text(0.5,1,'Program Counts')

#Episode Dates
#There are 5,737 unique program dates. There are 7,300 calendar days in 20 years. That means there are 1,563 days with no shows, equivalent of 4.28 years and ~78 days per year without shows. 
display(pd.DataFrame(episodes['episode_date'].value_counts()))

#Convert Episode Date into datetime datatype
episodes['episode_date'] = pd.to_datetime(episodes['episode_date'])

headlines.csv ¶

#Count of Headlines
print("Headline Count: " + str(headlines.shape[0]))
#Takeaway: Not all episodes have headlines

Headline Count: 97437

NLP Analysis ¶

Data Prep ¶

#Join episodes and utterances table
df_episodes_utterances_2 = pd.merge(episodes, utterances_2, how='inner', left_on = 'id', right_on = 'episode')
df_episodes_utterances_2.head()

#Program Counts with Visualization
display(pd.DataFrame(df_episodes_utterances_2['program'].value_counts()))

#Program Count Chart
display(episodes.program.value_counts().plot(kind = 'bar'))
plt.style.use('seaborn')
plt.title("Program Counts")

<matplotlib.axes._subplots.AxesSubplot at 0x161ca265390>

Text(0.5,1,'Program Counts')

#Subset to All Things Considered Programs
df_atc = df_episodes_utterances_2[(df_episodes_utterances_2["program"] == "All Things Considered")]

#Subset All Things Considered Programs to Guest Speaker Utterances Only
df_atc = df_atc[df_atc["is_host"] == False]

df_atc.head(3)

#Subset All Things Considered Guest Utterances to first quarter 2019
df_atc_q1_2019_guests = df_atc[(df_atc['episode_date'] > '2019-01-01') & (df_atc['episode_date'] < '2019-03-31')]

print("All Things Considered Guest Utterances (Population): " + str(df_atc.shape[0]))
print()
print("2019 Q1 All Things Considered Guest Utterances: " + str(df_atc_q1_2019_guests.shape[0]))
print()
print("2019 Q1 All Things Considered Guest Utterances (% of Population): " + str(round(df_atc_q1_2019_guests.shape[0]/df_atc.shape[0], 3)))

df_atc_q1_2019_guests.head(3)

All Things Considered Guest Utterances (Population): 312093

2019 Q1 All Things Considered Guest Utterances: 6926

2019 Q1 All Things Considered Guest Utterances (% of Population): 0.022

#Utterances by Episode. Note, 201 total episodes.
display(pd.DataFrame(df_atc_q1_2019_guests['episode'].value_counts()))

#Utterances by Program
display(df_atc_q1_2019_guests.episode.value_counts().plot(kind = 'bar',figsize=(40,10)))
plt.style.use('seaborn')
plt.title("Utterances by Episode")

<matplotlib.axes._subplots.AxesSubplot at 0x161cac82240>

Text(0.5,1,'Utterances by Episode')

Sentiment Analysis ¶

Sentiment Analysis Scoring Samples ¶

#Isolate utterances into list to feed sentiment analyzer
sentences = df_atc_q1_2019_guests['utterance'].tolist()

#Sentiment Scoring Samples
print("Sentiment Scoring Samples (Guests):")
for sentence in sentences[:3]:
     sid = SentimentIntensityAnalyzer()
     print(sentence)
     ss = sid.polarity_scores(sentence)
     for k in sorted(ss):
         print('{0}: {1}, '.format(k, ss[k]), end='')
     print()

Sentiment Scoring Samples (Guests):
My pleasure.
compound: 0.5719, neg: 0.0, neu: 0.213, pos: 0.787, 
Everything is coming out, Michel.
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
This is really the first time that we've seen what the American government knows about the full operations of the Sinaloa drug cartel.
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0,

Sentiment Analysis Results: Guests ¶

#Compile Sentiment Scoring
analyzer = SentimentIntensityAnalyzer()
result = {'pos': 0, 'neg': 0, 'neu': 0}
for sentence in sentences:
    score = analyzer.polarity_scores(sentence)
    if score['compound'] > 0.05:
        result['pos'] += 1
    elif score['compound'] < -0.05:
        result['neg'] += 1
    else:
       result['neu'] += 1
    
#Put scoring results into dataframe
scoring = pd.Series(result, name='utterances')
scoring.index.name = 'sentiment'
scoring.reset_index()
scoring = scoring.to_frame()
scoring = scoring.sort_values(by=['utterances'], ascending=False)
scoring['pct of observations'] = round((scoring['utterances'] /  scoring['utterances'].sum()), 3)

#Show scooring Dataframe
display(scoring)

#Scoring Chart
display(scoring.utterances.plot(kind = 'bar'))
plt.style.use('seaborn')
plt.title("'All Things Considered' Guest Sentiment For 2019 Q1")

<matplotlib.axes._subplots.AxesSubplot at 0x161cf8d08d0>

Text(0.5,1,"'All Things Considered' Guest Sentiment For 2019 Q1")

Sentiment Analysis Results: Hosts ¶

#Subset to All Things Considered Programs
df_atc = df_episodes_utterances_2[(df_episodes_utterances_2["program"] == "All Things Considered")]

#Subset All Things Considered Programs to Host Speaker Utterances Only
df_atc_hosts = df_atc[df_atc["is_host"] == True]

df_atc_q1_2019_hosts = df_atc_hosts[(df_atc_hosts['episode_date'] > '2019-01-01') & (df_atc_hosts['episode_date'] < '2019-03-31')]

#Isolate utterances into list to feed sentiment analyzer
sentences = df_atc_q1_2019_hosts['utterance'].tolist()

#Compile Sentiment Scoring
analyzer = SentimentIntensityAnalyzer()
result = {'pos': 0, 'neg': 0, 'neu': 0}
for sentence in sentences:
    score = analyzer.polarity_scores(sentence)
    if score['compound'] > 0.05:
        result['pos'] += 1
    elif score['compound'] < -0.05:
        result['neg'] += 1
    else:
       result['neu'] += 1
    
#Put scoring results into dataframe
scoring = pd.Series(result, name='utterances')
scoring.index.name = 'sentiment'
scoring.reset_index()
scoring = scoring.to_frame()
scoring = scoring.sort_values(by=['utterances'], ascending=False)
scoring['pct of observations'] = round((scoring['utterances'] /  scoring['utterances'].sum()), 3)

#Show scooring Dataframe
display(scoring)

#Scoring Chart
display(scoring.utterances.plot(kind = 'bar'))
plt.style.use('seaborn')
plt.title("'All Things Considered' Host Sentiment For 2019 Q1")

<matplotlib.axes._subplots.AxesSubplot at 0x161cf9f1da0>

Text(0.5,1,"'All Things Considered' Host Sentiment For 2019 Q1")

Tokenization ¶

#Iterate over utterances to tokenize utterances
cols = ['tokens_list']
lst = []
 
for i in df_atc_q1_2019_guests['utterance']:
    lst.append([zip])
    zip = nltk.word_tokenize(i)
 
df_tokens = pd.DataFrame(lst, columns=cols)
df_tokens = df_tokens.iloc[1: , :] 
    
display(df_tokens.head(3))

Stopwords Removal ¶

#Concatenate tokens into one list to remove stopwords and punctuation
token_list = df_tokens.tokens_list.sum()
token_list

#Remove punctuation
words=[word.lower() for word in token_list if word.isalpha()]

#Remove stopwords; add custom stopwords
stopwords = nltk.corpus.stopwords.words('english')
new_stop_words = ['i', 
                  "and", 
                  "think", 
                  'hi', 
                  'know', 
                  'people', 
                  'well',
                  'going',
                  'like',
                  'would',
                  'really',
                  'one',
                  'say',
                  'get',
                  'right',
                  'way',
                  'lot',
                  'said',
                  'also']
stopwords.extend(new_stop_words)

#Words with stopwords edited out
filtered_words = [word for word in words if word not in stopwords]

n-grams ¶

Unigrams ¶

#Unigrams
unigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 1)).value_counts())[:100], columns=["unigram_frequency"])
display(unigrams)

#Unigram chart
display(unigrams.sort_values(by=['unigram_frequency'], ascending=True).plot(kind = 'barh', figsize=(12,20)))
plt.title('Unigrams: Top 100 by Frequency')
plt.ylabel('Unigram')
plt.xlabel('Frequency')

<matplotlib.axes._subplots.AxesSubplot at 0x161f9e52710>

Text(0.5,0,'Frequency')

#Wordcloud

#Include all grams
unigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 1)).value_counts()), columns=["unigram_frequency"])

#Index to column
unigrams['gram_tuples'] = unigrams.index

#Remove tuples on index column values for WordCloud ingest
unigrams['gram_tuples'] = unigrams['gram_tuples'].str.join(' ').str.strip()

#Set dataframe for Wordcloud
data = unigrams.set_index('gram_tuples').to_dict()['unigram_frequency']

#Generate Wordcloud
wc = WordCloud(width=1000, height=1000).generate_from_frequencies(data)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title("Unigram WordCloud by Frequency")
plt.show()

Bigrams ¶

#Bigrams
bigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 2)).value_counts())[:50], columns=["bigram_frequency"])
display(bigrams)

#Bigram chart
display(bigrams.sort_values(by=['bigram_frequency'], ascending=True).plot(kind = 'barh', figsize=(12,10)))
plt.title('Bigrams: Top 50 by Frequency')
plt.ylabel('Bigram')
plt.xlabel('Frequency')

<matplotlib.axes._subplots.AxesSubplot at 0x161f9dfd860>

Text(0.5,0,'Frequency')

#Wordcloud

#Include all grams
bigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 2)).value_counts()), columns=["bigram_frequency"])

#Index to column
bigrams['gram_tuples'] = bigrams.index

#Remove tuples on index column values for WordCloud ingest
bigrams['gram_tuples'] = bigrams['gram_tuples'].str.join(' ').str.strip()

#Set dataframe for Wordcloud
data = bigrams.set_index('gram_tuples').to_dict()['bigram_frequency']

#Generate Wordcloud
wc = WordCloud(width=1000, height=1000).generate_from_frequencies(data)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title("Bigram WordCloud by Frequency")
plt.show()

Trigrams ¶

#Trigrams
trigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 3)).value_counts())[:50], columns=["trigram_frequency"])
display(trigrams)

#Trigram chart
display(trigrams.sort_values(by=['trigram_frequency'], ascending=True).plot(kind = 'barh', figsize=(12,10)))
plt.title('Trigrams: Top 50 by Frequency')
plt.ylabel('Trigram')
plt.xlabel('Frequency')

<matplotlib.axes._subplots.AxesSubplot at 0x161fcb2ce48>

Text(0.5,0,'Frequency')

#Wordcloud

#Include all grams
trigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 3)).value_counts()), columns=["trigram_frequency"])

#Index to column
trigrams['gram_tuples'] = trigrams.index

#Remove tuples on index column values for WordCloud ingest
trigrams['gram_tuples'] = trigrams['gram_tuples'].str.join(' ').str.strip()

#Set dataframe for Wordcloud
data = trigrams.set_index('gram_tuples').to_dict()['trigram_frequency']

#Generate Wordcloud
wc = WordCloud(width=1000, height=1000).generate_from_frequencies(data)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title("Trigram WordCloud by Frequency")
plt.show()

Quadgrams ¶

#Quadgrams
quadgrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 4)).value_counts())[:50], columns=["quadgram_frequency"])
display(quadgrams)

#Quadgram chart
display(quadgrams.sort_values(by=['quadgram_frequency'], ascending=True).plot(kind = 'barh', figsize=(12,10)))
plt.title('Quadgrams: Top 50 by Frequency')
plt.ylabel('Quadgram')
plt.xlabel('Frequency')

<matplotlib.axes._subplots.AxesSubplot at 0x161fcb0aba8>

Text(0.5,0,'Frequency')

#Wordcloud

#Include all grams
quadgrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 4)).value_counts()), columns=["quadgram_frequency"])

#Index to column
quadgrams['gram_tuples'] = quadgrams.index

#Remove tuples on index column values for WordCloud ingest
quadgrams['gram_tuples'] = quadgrams['gram_tuples'].str.join(' ').str.strip()

#Set dataframe for Wordcloud
data = quadgrams.set_index('gram_tuples').to_dict()['quadgram_frequency']

#Generate Wordcloud
wc = WordCloud(width=1000, height=1000).generate_from_frequencies(data)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title("Quadgram WordCloud by Frequency")
plt.show()

Pentgrams ¶

#Pentgrams
pentgrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 5)).value_counts())[:50], columns=["pentgram_frequency"])
display(pentgrams)

#Pentgram chart
display(pentgrams.sort_values(by=['pentgram_frequency'], ascending=True).plot(kind = 'barh', figsize=(12,10)))
plt.title('Pentgrams: Top 50 by Frequency')
plt.ylabel('Pentgram')
plt.xlabel('Frequency')

<matplotlib.axes._subplots.AxesSubplot at 0x161fcaca0b8>

Text(0.5,0,'Frequency')

#Wordcloud

#Include all grams
pentgrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 5)).value_counts()), columns=["pentgram_frequency"])

#Index to column
pentgrams['gram_tuples'] = pentgrams.index

#Remove tuples on index column values for WordCloud ingest
pentgrams['gram_tuples'] = pentgrams['gram_tuples'].str.join(' ').str.strip()

#Set dataframe for Wordcloud
data = pentgrams.set_index('gram_tuples').to_dict()['pentgram_frequency']

#Generate Wordcloud
wc = WordCloud(width=1000, height=1000).generate_from_frequencies(data)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title("Quadgram WordCloud by Frequency")
plt.show()

#End of Script

	id	program	title	episode_date
0	98814	Morning Edition	Senate Ushers In New Year With 'Fiscal Cliff' ...	2013-01-01
1	98824	Morning Edition	Cheap Bubbly Or Expensive Sparkling Wine? Look...	2012-12-31
2	98821	Morning Edition	U.S. Gas Prices Reach Record Level In 2012	2013-01-01

	id	headline
0	524288	For Some, The Decision To Enlist Offers Direction
1	524289	Whither The Astronauts Without A Shuttle?
2	524292	Tour Winner May Not Be First Over Finish Line

	episode	episode_order	speaker	utterance
0	57264	9	Ms. LOREN MOONEY (Editor-in-Chief, Bicycling M...	It's a 2,200-mile race. To give some sense of ...
1	57264	10	Ms. LOREN MOONEY (Editor-in-Chief, Bicycling M...	So for a top competitor like Lance to try to m...
2	57264	11	NEAL CONAN, host	So in every team, presumably there's one star,...

	program
All Things Considered	41620
Morning Edition	29997
Talk of the Nation	9117
Day to Day	7847
Weekend Edition Saturday	6008
Weekend Edition Sunday	5971
News & Notes	5288

	episode_date
2005-08-23	38
2007-09-04	37
2005-10-12	37
2007-10-16	37
2009-01-23	37
2007-11-16	37
2005-12-08	37
2007-08-14	37
2005-12-20	37
2005-12-13	37
2007-07-06	37
2005-08-16	37
2005-12-07	36
2005-07-15	36
2007-07-24	36
2008-10-03	36
2007-06-28	36
2005-05-09	36
2007-10-26	36
2005-12-14	36
2005-05-30	36
2009-01-21	36
2007-01-26	36
2007-06-05	36
2008-09-18	36
2005-11-03	36
2005-12-19	36
2007-08-21	36
2005-09-15	36
2007-06-21	36
2007-05-25	36
2005-12-05	36
2005-05-02	36
2007-06-01	36
2006-01-04	36
2006-06-22	36
2006-09-15	36
2007-05-10	36
2006-05-09	36
2007-10-23	36
2007-08-07	36
2005-09-13	36
2005-11-14	36
2009-02-24	36
2006-09-19	36
2007-11-06	36
2006-06-07	36
2007-10-09	36
2006-06-23	36
2006-07-21	36
...	...
2004-05-11	1
2005-01-17	1
2004-12-18	1
2004-12-28	1
2004-11-25	1
2003-06-09	1
2003-12-21	1
2003-09-17	1
2004-12-19	1
2003-09-29	1
2003-07-29	1
2004-03-03	1
2003-04-29	1
2004-09-10	1
2003-09-10	1
2003-06-23	1
2003-09-23	1
2002-11-05	1
2002-12-22	1
2004-04-06	1
2005-01-18	1
2004-04-12	1
2003-03-18	1
2003-01-05	1
2005-01-12	1
2004-06-25	1
2003-12-03	1
2002-11-04	1
2004-03-14	1
2005-02-26	1
2005-04-01	1
2004-04-28	1
2003-09-27	1
2003-09-05	1
2002-12-15	1
2004-05-16	1
2003-10-18	1
2004-05-03	1
2003-01-12	1
2002-12-27	1
2003-07-19	1
2004-04-24	1
2003-08-21	1
2004-06-02	1
2003-05-21	1
2003-07-05	1
2002-12-26	1
2003-11-15	1
2004-07-23	1
2004-04-17	1

	episode	episode_order	turn_order	is_host	utterance
0	1	1	0	True	The impeachment inquiry picks up tomorrow wher...
1	1	1	1	True	Just this morning, the lawyer for the whistleb...
2	1	1	2	True	There's are a lot of moving parts.

	id	program	title	episode_date	episode	episode_order	turn_order	host_id	is_host	utterance
0	98820	Morning Edition	Significance Of Kwanzaa Changes Over The Years	2013-01-01	98820	1	0	14	True	Rounding out the holiday season, Kwanza comes ...
1	98820	Morning Edition	Significance Of Kwanzaa Changes Over The Years	2013-01-01	98820	1	1	14	True	It's the only official African-American holiday.
2	98820	Morning Edition	Significance Of Kwanzaa Changes Over The Years	2013-01-01	98820	1	2	14	True	And it began at the height of the 1960s black ...
3	98820	Morning Edition	Significance Of Kwanzaa Changes Over The Years	2013-01-01	98820	2	0	14	True	But the generation that helped create Kwanzaa ...
4	98820	Morning Edition	Significance Of Kwanzaa Changes Over The Years	2013-01-01	98820	2	1	14	True	Journalist Gene Demby recently joined NPR, to ...

	program
All Things Considered	507576
Morning Edition	196701
Weekend Edition Saturday	121117
News & Notes	120275
Talk of the Nation	109329
Weekend Edition Sunday	101975
Day to Day	83139

	id	program	title	episode_date	episode	episode_order	turn_order	speaker_order	host_id	is_host	utterance
372	98866	All Things Considered	In The Voice Of A Border District, A Story Of ...	2014-07-16	98866	2	0	1	-1	False	Well, there's portions of it that I certainly ...
373	98866	All Things Considered	In The Voice Of A Border District, A Story Of ...	2014-07-16	98866	2	1	1	-1	False	We've got to have facilities there, we have ne...
374	98866	All Things Considered	In The Voice Of A Border District, A Story Of ...	2014-07-16	98866	2	2	1	-1	False	And so I will tell you, that having visited de...

	id	program	title	episode_date	episode	episode_order	turn_order	speaker_order	host_id	is_host	utterance
352052	33909	All Things Considered	Vivid Details Revealed As El Chapo Trial Conti...	2019-01-05	33909	3	0	1	-1	False	My pleasure.
352054	33909	All Things Considered	Vivid Details Revealed As El Chapo Trial Conti...	2019-01-05	33909	5	0	1	-1	False	Everything is coming out, Michel.
352055	33909	All Things Considered	Vivid Details Revealed As El Chapo Trial Conti...	2019-01-05	33909	5	1	1	-1	False	This is really the first time that we've seen ...

	episode
31415	97
30714	91
31435	80
30449	80
32300	73
30457	72
32288	70
30165	68
33201	66
32709	65
31227	64
33221	63
32717	62
31928	61
33932	60
32684	57
33203	54
32524	53
32703	53
32693	53
33903	52
33205	52
20509	52
33430	51
30441	51
31436	50
33409	49
31240	49
30693	49
31646	48
32995	45
33920	44
31413	44
33220	44
31656	44
33681	41
33691	41
31407	41
32552	40
30164	40
30197	40
30179	40
19735	39
20488	39
32694	39
31638	39
20238	38
20236	38
31934	38
33685	38
...	...
33928	26
31429	26
33908	26
30690	26
31936	26
30442	25
30716	25
20240	25
20510	25
19732	25
33697	25
33411	24
33909	24
30695	24
30178	24
33006	24
32987	23
32301	23
30440	23
31648	23
31944	23
19725	23
33417	23
32679	23
33437	23
33431	23
33200	22
31953	22
33922	21
32686	21
30180	21
30444	21
32529	20
20248	20
33198	20
31652	20
33199	20
31959	19
32289	19
33011	19
20246	18
30193	18
33704	18
30706	18
33425	16
31961	15
20233	10
33019	7
30723	5
30718	2

	tokens_list
1	[My, pleasure, .]
2	[Everything, is, coming, out, ,, Michel, .]
3	[This, is, really, the, first, time, that, we,...

	unigram_frequency
(president,)	210
(thank,)	192
(time,)	189
(much,)	185
(things,)	162
(years,)	162
(sort,)	155
(yeah,)	155
(mean,)	151
(could,)	149
(see,)	147
(want,)	142
(kind,)	140
(something,)	140
(trump,)	137
(make,)	133
(actually,)	133
(back,)	131
(even,)	127
(first,)	118
(go,)	115
(two,)	113
(need,)	113
(come,)	112
(thing,)	111
(government,)	110
(new,)	109
(many,)	106
(women,)	103
(still,)	101
(take,)	100
(country,)	98
(states,)	95
(fact,)	95
(day,)	94
(us,)	94
(got,)	92
(look,)	91
(every,)	90
(border,)	90
(may,)	89
(house,)	87
(yes,)	86
(around,)	85
(public,)	84
(good,)	84
(question,)	84
(world,)	83
(saying,)	82
(part,)	82
(might,)	81
(work,)	81
(case,)	79
(united,)	79
(put,)	79
(point,)	78
(whether,)	78
(last,)	77
(never,)	76
(important,)	73
(end,)	73
(trying,)	73
(sure,)	71
(big,)	70
(state,)	70
(american,)	69
(different,)	67
(thanks,)	67
(course,)	66
(school,)	66
(money,)	66
(wall,)	65
(says,)	64
(feel,)	64
(system,)	64
(little,)	63
(seen,)	63
(absolutely,)	63
(another,)	63
(made,)	63
(white,)	63
(used,)	62
(today,)	61
(ca,)	61
(problem,)	61
(talking,)	58
(year,)	57
(getting,)	57
(able,)	56
(mueller,)	56
(went,)	55
(number,)	55
(deal,)	55
(place,)	54
(security,)	54
(justice,)	54
(called,)	53
(talk,)	53
(life,)	53
(believe,)	53

	bigram_frequency
(united, states)	73
(thank, much)	41
(president, trump)	40
(little, bit)	30
(minimum, wage)	30
(new, york)	29
(attorney, general)	29
(special, counsel)	26
(years, ago)	24
(thank, thank)	22
(make, sure)	22
(north, korea)	22
(two, years)	20
(every, day)	20
(white, house)	18
(trump, administration)	16
(donald, trump)	16
(national, security)	15
(yeah, yeah)	15
(last, year)	14
(first, time)	14
(thank, ari)	14
(social, media)	14
(border, patrol)	14
(every, single)	14
(islamic, state)	14
(world, cup)	14
(thank, michel)	14
(border, security)	13
(bob, mueller)	13
(supreme, court)	13
(el, paso)	13
(mary, louise)	12
(theresa, may)	12
(around, world)	12
(obstruction, justice)	12
(department, justice)	12
(go, back)	12
(even, though)	11
(high, school)	11
(yeah, mean)	11
(saudi, arabia)	11
(house, senate)	11
(pleasure, thank)	11
(fake, news)	11
(many, many)	10
(end, day)	10
(justice, department)	10
(michael, cohen)	10
(north, koreans)	10

	trigram_frequency
(new, york, times)	8
(kim, jong, un)	7
(love, love, love)	7
(north, korea, united)	6
(president, united, states)	6
(korea, united, states)	6
(thank, thank, much)	5
(attorney, general, barr)	5
(minimum, wage, increase)	4
(new, minimum, wage)	4
(last, two, years)	4
(attorney, general, rod)	4
(general, rod, rosenstein)	4
(government, united, states)	4
(every, single, day)	4
(federal, minimum, wage)	4
(special, counsel, mueller)	4
(minimum, wage, new)	4
(past, couple, years)	4
(yeah, yeah, yeah)	4
(new, york, city)	4
(thank, much, thanks)	4
(warm, ocean, water)	4
(black, brown, kids)	3
(order, new, election)	3
(percent, likely, receive)	3
(trump, inner, circle)	3
(department, homeland, security)	3
(three, years, ago)	3
(end, obama, administration)	3
(authorities, new, york)	3
(state, attorney, office)	3
(pleasure, thank, much)	3
(many, many, years)	3
(unidentified, actor, character)	3
(mary, louise, good)	3
(elite, public, schools)	3
(thank, much, thank)	3
(thank, much, pleasure)	3
(thank, audie, thank)	3
(declare, national, emergency)	3
(specialized, high, schools)	3
(every, day, wake)	3
(thanks, much, thank)	3
(attorney, general, bill)	3
(special, counsel, office)	3
(see, mueller, report)	3
(hey, mary, louise)	3
(general, bill, barr)	3
(house, oversight, committee)	3

	utterances	pct of observations
sentiment
neu	2721	0.393
pos	2697	0.389
neg	1508	0.218

	utterances	pct of observations
sentiment
neu	2273	0.487
pos	1582	0.339
neg	817	0.175

	quadgram_frequency
(love, love, love, love)	6
(north, korea, united, states)	6
(attorney, general, rod, rosenstein)	4
(attorney, general, bill, barr)	3
(special, counsel, bob, mueller)	2
(libertarian, argument, government, set)	2
(leads, reduction, job, growth)	2
(bill, likely, first, pieces)	2
(percent, extra, heat, trapped)	2
(eight, states, minimum, wage)	2
(standard, living, pretty, much)	2
(meaningful, grand, scheme, things)	2
(find, even, someone, working)	2
(important, recognize, economics, research)	2
(general, rod, rosenstein, took)	2
(million, workers, across, country)	2
(wage, set, state, legislature)	2
(government, set, standards, around)	2
(try, vote, minimum, wage)	2
(increase, new, york, city)	2
(established, back, intended, living)	2
(may, minimum, wage, increase)	2
(united, states, united, states)	2
(even, hour, basically, anywhere)	2
(point, probably, economically, meaningful)	2
(minimum, wage, popular, nationally)	2
(set, state, legislature, legislation)	2
(working, full, year, probably)	2
(consider, modest, adequate, standard)	2
(simply, businesses, choose, pay)	2
(costs, come, higher, minimum)	2
(likely, receive, abuse, white)	2
(general, bill, barr, deputy)	2
(water, percent, extra, heat)	2
(mean, special, counsel, mueller)	2
(policy, institute, called, family)	2
(basically, anywhere, country, next)	2
(white, women, black, women)	2
(minimum, wage, new, level)	2
(genuinely, concerned, increase, labor)	2

	pentgram_frequency
(love, love, love, love, love)	5
(living, pretty, much, jurisdiction, country)	2
(eight, states, minimum, wage, went)	2
(percent, likely, receive, abuse, white)	2
(ballot, box, chose, raise, state)	2
(unidentified, actor, character, speaking, german)	2
(minimum, wage, money, need, come)	2
(wage, went, simply, automatic, adjustment)	2
(energy, warm, ocean, water, percent)	2
(policy, institute, called, family, budget)	2
(passed, established, new, minimum, wage)	2
(percent, extra, heat, trapped, inside)	2
(folks, genuinely, concerned, increase, labor)	2
(working, full, year, probably, need)	2
(businesses, choose, pay, workers, purely)	2
(automatically, adjusted, account, increase, prices)	2
(minimum, wage, set, state, legislature)	2
(ones, try, vote, minimum, wage)	2
(words, voters, directly, ballot, box)	2
(wage, set, state, legislature, legislation)	2
(end, year, net, weeks, hours)	2
(economically, meaningful, grand, scheme, things)	2
(raised, federal, minimum, wage, infrequently)	2
(pay, workers, purely, libertarian, argument)	2
(vote, minimum, wage, popular, nationally)	2
(standard, living, pretty, much, jurisdiction)	2
(minimum, wage, increase, leads, reduction)	2
(argument, government, set, standards, around)	2
(gap, today, need, living, wage)	2
(modest, adequate, standard, living, pretty)	2
(economic, policy, institute, called, family)	2
(anything, increase, alaska, hour, increase)	2
(firing, former, fbi, director, jim)	2
(someone, working, full, year, probably)	2
(inadequately, gap, today, need, living)	2
(recognize, economics, research, minimum, wage)	2
(lift, wages, million, workers, across)	2
(pieces, legislation, house, democrats, introduce)	2
(could, first, ones, try, vote)	2
(wage, impact, jobs, fairly, small)	2