NLP Analysis: National Public Radio Transcripts

By Eric Stromgren | November 5, 2021

Executive Summary

Summary
This natural language processing (NLP) analysis of National Public Radio (NPR) transcripts samples 'All Things Considered' two-speaker conversations to reveal frequently discussed topics by show guests for the first quarter of 2019. It also classifies sentiment for hosts and guests during the sample period.

The analysis revealed a high frequency of discussions involving United States political topics. Frequent subjects included President Donald Trump, the Trump Administration, U.S. southern border security, federal minimum wage, Attorney General Bill Barr, Attorney General Rod Rosenstein, the special counsel investigation by Bob Mueller, North Korea, Saudi Arabia and the Islamic State.

Guest sentiment was more likely to be positive or neutral than negative, while host sentiment was more likely to be netural than positive or negative. A hypothesis for this result is guests responded more positively to a more objective host interview style.

Methodology
The primary method driving this NLP analysis is the data transformation of 'All Things Considered' program conversation speech into n-grams. This approach allows for identification of frequent content topics during the sample period.

The sample period, the first quarter of 2019, includes 201 'All Things Considered' episodes containing 6,926 guest utterances and 4,672 host utterances. The sample period of one quarter is chosen as a processing example and is narrow enough to provide relevant slice-of-time insights. Ultimately, other time slices can be used to analyze topic frequency shifts over time. For this case, processing the entire 20-year population of transcript data did not seem logical because doing so would likely hide seasonality. Topics discussed in 1999 may not be relevant in 2019, for example.

A sentiment analysis using the out-of-the-box VADER model from the NLTK Python library is used to reveal attitudes of the guests and hosts. One would anticipate a mainstream media outlet to practice objective journalism, and there is some support for this hypothesis as host sentiment was classified 34% positive, 49% neutral and 17% negative. Media guests likely have some agenda or bias to push, and there is some support for this hypothesis as guest sentiment was classified 39% positive, 39% neutral and 22% negative. In sum, hosts tended to be more neutral than guests while guests tended to react with more positivity than negativity.

In conclusion, the NPR dataset has similarities with a telecommunications company customer support call center. This analysis is viewed through the lens of a call center structure where company agents and customers interact through conversation to resolve account-related issues. NPR programs can be viewed as a proxy for the company's lines of business, NPR hosts can be viewed as a proxy for call center agents and NPR program guests can be viewed as a proxy as customers. This type of an NLP analysis focusing on topic frequency can help a telcommunications business surface potential product feature issues experienced by customers and identify sentiment.

Next Steps
Understanding what customers are frequently talking about – NPR guests in this case – is a logical first step in identifying product features impacting customers. The sentiment analysis can be improved by developing a customized sentiment dictionary tailored for this specific NPR dataset. A customized dictionary would likely improve sentiment scoring classification accuracy.

This analysis focuses on one line of business – 'All Things Considered' in this case – but could be easily applied to other lines of business across different time segments to answer various business questions. For NPR, an increase in topic frequency over time may reflect the movement of public interest and programming decisions. For a telecommunications call center, a topic frequency decrease may indicate lower call volume and decreased friction between product features and customers.

Exploratory Data Analysis

Dataset Descriptions

Source: https://www.kaggle.com/shuyangli94/interview-npr-media-dialog-transcripts/

episodes.csv: Metadata about all scraped episodes of NPR podcasts, from 1999 to 2019

headlines.csv: Headlines along with their ID.

host-map.json: Dictionary of host ID: name (lowercase name), episodes (list of episode IDs hosted), programs (list of programs hosted)

host_id.json: Dictionary of lowercase host name : host ID

splits-ns2.json: Dictionary of split ("train", "valid", or "test") : list of episode IDs for that split

utterances-sp2.csv: Utterance-level breakdown of all 2-speaker conversations

utterances.csv: Conversation turns for every episode (multi-speaker included)

Python Setup

In [1]:
###libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import nltk
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from wordcloud import WordCloud

#Increase cell width for browser view on Jupyter Notebook
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

#Display Max Rows on Dataframes = 100
pd.set_option('display.max_rows', 100)
C:\Users\Owner\Anaconda3\lib\site-packages\nltk\twitter\__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
  warnings.warn("The twython library has not been installed. "
In [2]:
#read JSON
with open('D:/data_projects/npr_analysis/data/host_map.json', 'r') as myfile:
    data=myfile.read()

# parse file
obj = json.loads(data)

#display(obj)

Raw Data

In [3]:
##Read Data
episodes = pd.DataFrame(pd.read_csv('D:/data_projects/npr_analysis/data/episodes.csv'))
print("Episodes")
display(pd.DataFrame(episodes.head(n=3)))

headlines = pd.DataFrame(pd.read_csv('D:/data_projects/npr_analysis/data/headlines.csv'))
print("Headlines")
display(pd.DataFrame(headlines.head(n=3)))

utterances = pd.DataFrame(pd.read_csv('D:/data_projects/npr_analysis/data/utterances.csv'))
print("Utterances")
display(pd.DataFrame(utterances.head(n=3)))

utterances_2 = pd.DataFrame(pd.read_csv('D:/data_projects/npr_analysis/data/utterances-2sp.csv'))
print("2 Speaker Utterances")
display(pd.DataFrame(utterances_2.head(n=3)))
Episodes
id program title episode_date
0 98814 Morning Edition Senate Ushers In New Year With 'Fiscal Cliff' ... 2013-01-01
1 98824 Morning Edition Cheap Bubbly Or Expensive Sparkling Wine? Look... 2012-12-31
2 98821 Morning Edition U.S. Gas Prices Reach Record Level In 2012 2013-01-01
Headlines
id headline
0 524288 For Some, The Decision To Enlist Offers Direction
1 524289 Whither The Astronauts Without A Shuttle?
2 524292 Tour Winner May Not Be First Over Finish Line
Utterances
episode episode_order speaker utterance
0 57264 9 Ms. LOREN MOONEY (Editor-in-Chief, Bicycling M... It's a 2,200-mile race. To give some sense of ...
1 57264 10 Ms. LOREN MOONEY (Editor-in-Chief, Bicycling M... So for a top competitor like Lance to try to m...
2 57264 11 NEAL CONAN, host So in every team, presumably there's one star,...
2 Speaker Utterances
episode episode_order turn_order speaker_order host_id is_host utterance
0 1 1 0 0 0 True The impeachment inquiry picks up tomorrow wher...
1 1 1 1 0 0 True Just this morning, the lawyer for the whistleb...
2 1 1 2 0 0 True There's are a lot of moving parts.

episodes.csv

In [4]:
#Count of Shows
print("Episode Count: " + str(episodes.shape[0]))

#Program Counts with Visualization
display(pd.DataFrame(episodes['program'].value_counts()))

#Program Count Chart
display(episodes.program.value_counts().plot(kind = 'bar'))
plt.style.use('seaborn')
plt.title("Program Counts")
Episode Count: 105848
program
All Things Considered 41620
Morning Edition 29997
Talk of the Nation 9117
Day to Day 7847
Weekend Edition Saturday 6008
Weekend Edition Sunday 5971
News & Notes 5288
<matplotlib.axes._subplots.AxesSubplot at 0x16213b67a58>
Out[4]:
Text(0.5,1,'Program Counts')
In [5]:
#Episode Dates
#There are 5,737 unique program dates. There are 7,300 calendar days in 20 years. That means there are 1,563 days with no shows, equivalent of 4.28 years and ~78 days per year without shows. 
display(pd.DataFrame(episodes['episode_date'].value_counts()))
episode_date
2005-08-23 38
2007-09-04 37
2005-10-12 37
2007-10-16 37
2009-01-23 37
2007-11-16 37
2005-12-08 37
2007-08-14 37
2005-12-20 37
2005-12-13 37
2007-07-06 37
2005-08-16 37
2005-12-07 36
2005-07-15 36
2007-07-24 36
2008-10-03 36
2007-06-28 36
2005-05-09 36
2007-10-26 36
2005-12-14 36
2005-05-30 36
2009-01-21 36
2007-01-26 36
2007-06-05 36
2008-09-18 36
2005-11-03 36
2005-12-19 36
2007-08-21 36
2005-09-15 36
2007-06-21 36
2007-05-25 36
2005-12-05 36
2005-05-02 36
2007-06-01 36
2006-01-04 36
2006-06-22 36
2006-09-15 36
2007-05-10 36
2006-05-09 36
2007-10-23 36
2007-08-07 36
2005-09-13 36
2005-11-14 36
2009-02-24 36
2006-09-19 36
2007-11-06 36
2006-06-07 36
2007-10-09 36
2006-06-23 36
2006-07-21 36
... ...
2004-05-11 1
2005-01-17 1
2004-12-18 1
2004-12-28 1
2004-11-25 1
2003-06-09 1
2003-12-21 1
2003-09-17 1
2004-12-19 1
2003-09-29 1
2003-07-29 1
2004-03-03 1
2003-04-29 1
2004-09-10 1
2003-09-10 1
2003-06-23 1
2003-09-23 1
2002-11-05 1
2002-12-22 1
2004-04-06 1
2005-01-18 1
2004-04-12 1
2003-03-18 1
2003-01-05 1
2005-01-12 1
2004-06-25 1
2003-12-03 1
2002-11-04 1
2004-03-14 1
2005-02-26 1
2005-04-01 1
2004-04-28 1
2003-09-27 1
2003-09-05 1
2002-12-15 1
2004-05-16 1
2003-10-18 1
2004-05-03 1
2003-01-12 1
2002-12-27 1
2003-07-19 1
2004-04-24 1
2003-08-21 1
2004-06-02 1
2003-05-21 1
2003-07-05 1
2002-12-26 1
2003-11-15 1
2004-07-23 1
2004-04-17 1

5737 rows × 1 columns

In [6]:
#Convert Episode Date into datetime datatype
episodes['episode_date'] = pd.to_datetime(episodes['episode_date'])

headlines.csv

In [7]:
#Count of Headlines
print("Headline Count: " + str(headlines.shape[0]))
#Takeaway: Not all episodes have headlines
Headline Count: 97437

NLP Analysis

Data Prep

In [8]:
#Join episodes and utterances table
df_episodes_utterances_2 = pd.merge(episodes, utterances_2, how='inner', left_on = 'id', right_on = 'episode')
df_episodes_utterances_2.head()
Out[8]:
id program title episode_date episode episode_order turn_order speaker_order host_id is_host utterance
0 98820 Morning Edition Significance Of Kwanzaa Changes Over The Years 2013-01-01 98820 1 0 0 14 True Rounding out the holiday season, Kwanza comes ...
1 98820 Morning Edition Significance Of Kwanzaa Changes Over The Years 2013-01-01 98820 1 1 0 14 True It's the only official African-American holiday.
2 98820 Morning Edition Significance Of Kwanzaa Changes Over The Years 2013-01-01 98820 1 2 0 14 True And it began at the height of the 1960s black ...
3 98820 Morning Edition Significance Of Kwanzaa Changes Over The Years 2013-01-01 98820 2 0 0 14 True But the generation that helped create Kwanzaa ...
4 98820 Morning Edition Significance Of Kwanzaa Changes Over The Years 2013-01-01 98820 2 1 0 14 True Journalist Gene Demby recently joined NPR, to ...
In [9]:
#Program Counts with Visualization
display(pd.DataFrame(df_episodes_utterances_2['program'].value_counts()))

#Program Count Chart
display(episodes.program.value_counts().plot(kind = 'bar'))
plt.style.use('seaborn')
plt.title("Program Counts")
program
All Things Considered 507576
Morning Edition 196701
Weekend Edition Saturday 121117
News & Notes 120275
Talk of the Nation 109329
Weekend Edition Sunday 101975
Day to Day 83139
<matplotlib.axes._subplots.AxesSubplot at 0x161ca265390>
Out[9]:
Text(0.5,1,'Program Counts')
In [10]:
#Subset to All Things Considered Programs
df_atc = df_episodes_utterances_2[(df_episodes_utterances_2["program"] == "All Things Considered")]

#Subset All Things Considered Programs to Guest Speaker Utterances Only
df_atc = df_atc[df_atc["is_host"] == False]

df_atc.head(3)
Out[10]:
id program title episode_date episode episode_order turn_order speaker_order host_id is_host utterance
372 98866 All Things Considered In The Voice Of A Border District, A Story Of ... 2014-07-16 98866 2 0 1 -1 False Well, there's portions of it that I certainly ...
373 98866 All Things Considered In The Voice Of A Border District, A Story Of ... 2014-07-16 98866 2 1 1 -1 False We've got to have facilities there, we have ne...
374 98866 All Things Considered In The Voice Of A Border District, A Story Of ... 2014-07-16 98866 2 2 1 -1 False And so I will tell you, that having visited de...
In [11]:
#Subset All Things Considered Guest Utterances to first quarter 2019
df_atc_q1_2019_guests = df_atc[(df_atc['episode_date'] > '2019-01-01') & (df_atc['episode_date'] < '2019-03-31')]

print("All Things Considered Guest Utterances (Population): " + str(df_atc.shape[0]))
print()
print("2019 Q1 All Things Considered Guest Utterances: " + str(df_atc_q1_2019_guests.shape[0]))
print()
print("2019 Q1 All Things Considered Guest Utterances (% of Population): " + str(round(df_atc_q1_2019_guests.shape[0]/df_atc.shape[0], 3)))

df_atc_q1_2019_guests.head(3)
All Things Considered Guest Utterances (Population): 312093

2019 Q1 All Things Considered Guest Utterances: 6926

2019 Q1 All Things Considered Guest Utterances (% of Population): 0.022
Out[11]:
id program title episode_date episode episode_order turn_order speaker_order host_id is_host utterance
352052 33909 All Things Considered Vivid Details Revealed As El Chapo Trial Conti... 2019-01-05 33909 3 0 1 -1 False My pleasure.
352054 33909 All Things Considered Vivid Details Revealed As El Chapo Trial Conti... 2019-01-05 33909 5 0 1 -1 False Everything is coming out, Michel.
352055 33909 All Things Considered Vivid Details Revealed As El Chapo Trial Conti... 2019-01-05 33909 5 1 1 -1 False This is really the first time that we've seen ...
In [12]:
#Utterances by Episode. Note, 201 total episodes.
display(pd.DataFrame(df_atc_q1_2019_guests['episode'].value_counts()))

#Utterances by Program
display(df_atc_q1_2019_guests.episode.value_counts().plot(kind = 'bar',figsize=(40,10)))
plt.style.use('seaborn')
plt.title("Utterances by Episode")
episode
31415 97
30714 91
31435 80
30449 80
32300 73
30457 72
32288 70
30165 68
33201 66
32709 65
31227 64
33221 63
32717 62
31928 61
33932 60
32684 57
33203 54
32524 53
32703 53
32693 53
33903 52
33205 52
20509 52
33430 51
30441 51
31436 50
33409 49
31240 49
30693 49
31646 48
32995 45
33920 44
31413 44
33220 44
31656 44
33681 41
33691 41
31407 41
32552 40
30164 40
30197 40
30179 40
19735 39
20488 39
32694 39
31638 39
20238 38
20236 38
31934 38
33685 38
... ...
33928 26
31429 26
33908 26
30690 26
31936 26
30442 25
30716 25
20240 25
20510 25
19732 25
33697 25
33411 24
33909 24
30695 24
30178 24
33006 24
32987 23
32301 23
30440 23
31648 23
31944 23
19725 23
33417 23
32679 23
33437 23
33431 23
33200 22
31953 22
33922 21
32686 21
30180 21
30444 21
32529 20
20248 20
33198 20
31652 20
33199 20
31959 19
32289 19
33011 19
20246 18
30193 18
33704 18
30706 18
33425 16
31961 15
20233 10
33019 7
30723 5
30718 2

201 rows × 1 columns

<matplotlib.axes._subplots.AxesSubplot at 0x161cac82240>
Out[12]:
Text(0.5,1,'Utterances by Episode')

Sentiment Analysis

Sentiment Analysis Scoring Samples

In [13]:
#Isolate utterances into list to feed sentiment analyzer
sentences = df_atc_q1_2019_guests['utterance'].tolist()

#Sentiment Scoring Samples
print("Sentiment Scoring Samples (Guests):")
for sentence in sentences[:3]:
     sid = SentimentIntensityAnalyzer()
     print(sentence)
     ss = sid.polarity_scores(sentence)
     for k in sorted(ss):
         print('{0}: {1}, '.format(k, ss[k]), end='')
     print()
Sentiment Scoring Samples (Guests):
My pleasure.
compound: 0.5719, neg: 0.0, neu: 0.213, pos: 0.787, 
Everything is coming out, Michel.
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
This is really the first time that we've seen what the American government knows about the full operations of the Sinaloa drug cartel.
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 

Sentiment Analysis Results: Guests

In [14]:
#Compile Sentiment Scoring
analyzer = SentimentIntensityAnalyzer()
result = {'pos': 0, 'neg': 0, 'neu': 0}
for sentence in sentences:
    score = analyzer.polarity_scores(sentence)
    if score['compound'] > 0.05:
        result['pos'] += 1
    elif score['compound'] < -0.05:
        result['neg'] += 1
    else:
       result['neu'] += 1
    
#Put scoring results into dataframe
scoring = pd.Series(result, name='utterances')
scoring.index.name = 'sentiment'
scoring.reset_index()
scoring = scoring.to_frame()
scoring = scoring.sort_values(by=['utterances'], ascending=False)
scoring['pct of observations'] = round((scoring['utterances'] /  scoring['utterances'].sum()), 3)

#Show scooring Dataframe
display(scoring)

#Scoring Chart
display(scoring.utterances.plot(kind = 'bar'))
plt.style.use('seaborn')
plt.title("'All Things Considered' Guest Sentiment For 2019 Q1")
utterances pct of observations
sentiment
neu 2721 0.393
pos 2697 0.389
neg 1508 0.218
<matplotlib.axes._subplots.AxesSubplot at 0x161cf8d08d0>
Out[14]:
Text(0.5,1,"'All Things Considered' Guest Sentiment For 2019 Q1")

Sentiment Analysis Results: Hosts

In [15]:
#Subset to All Things Considered Programs
df_atc = df_episodes_utterances_2[(df_episodes_utterances_2["program"] == "All Things Considered")]

#Subset All Things Considered Programs to Host Speaker Utterances Only
df_atc_hosts = df_atc[df_atc["is_host"] == True]

df_atc_q1_2019_hosts = df_atc_hosts[(df_atc_hosts['episode_date'] > '2019-01-01') & (df_atc_hosts['episode_date'] < '2019-03-31')]

#Isolate utterances into list to feed sentiment analyzer
sentences = df_atc_q1_2019_hosts['utterance'].tolist()

#Compile Sentiment Scoring
analyzer = SentimentIntensityAnalyzer()
result = {'pos': 0, 'neg': 0, 'neu': 0}
for sentence in sentences:
    score = analyzer.polarity_scores(sentence)
    if score['compound'] > 0.05:
        result['pos'] += 1
    elif score['compound'] < -0.05:
        result['neg'] += 1
    else:
       result['neu'] += 1
    
#Put scoring results into dataframe
scoring = pd.Series(result, name='utterances')
scoring.index.name = 'sentiment'
scoring.reset_index()
scoring = scoring.to_frame()
scoring = scoring.sort_values(by=['utterances'], ascending=False)
scoring['pct of observations'] = round((scoring['utterances'] /  scoring['utterances'].sum()), 3)

#Show scooring Dataframe
display(scoring)

#Scoring Chart
display(scoring.utterances.plot(kind = 'bar'))
plt.style.use('seaborn')
plt.title("'All Things Considered' Host Sentiment For 2019 Q1")
utterances pct of observations
sentiment
neu 2273 0.487
pos 1582 0.339
neg 817 0.175
<matplotlib.axes._subplots.AxesSubplot at 0x161cf9f1da0>
Out[15]:
Text(0.5,1,"'All Things Considered' Host Sentiment For 2019 Q1")

Tokenization

In [16]:
#Iterate over utterances to tokenize utterances
cols = ['tokens_list']
lst = []
 
for i in df_atc_q1_2019_guests['utterance']:
    lst.append([zip])
    zip = nltk.word_tokenize(i)
 
df_tokens = pd.DataFrame(lst, columns=cols)
df_tokens = df_tokens.iloc[1: , :] 
    
display(df_tokens.head(3))
tokens_list
1 [My, pleasure, .]
2 [Everything, is, coming, out, ,, Michel, .]
3 [This, is, really, the, first, time, that, we,...

Stopwords Removal

In [17]:
#Concatenate tokens into one list to remove stopwords and punctuation
token_list = df_tokens.tokens_list.sum()
token_list

#Remove punctuation
words=[word.lower() for word in token_list if word.isalpha()]

#Remove stopwords; add custom stopwords
stopwords = nltk.corpus.stopwords.words('english')
new_stop_words = ['i', 
                  "and", 
                  "think", 
                  'hi', 
                  'know', 
                  'people', 
                  'well',
                  'going',
                  'like',
                  'would',
                  'really',
                  'one',
                  'say',
                  'get',
                  'right',
                  'way',
                  'lot',
                  'said',
                  'also']
stopwords.extend(new_stop_words)

#Words with stopwords edited out
filtered_words = [word for word in words if word not in stopwords]

n-grams

Unigrams

In [18]:
#Unigrams
unigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 1)).value_counts())[:100], columns=["unigram_frequency"])
display(unigrams)

#Unigram chart
display(unigrams.sort_values(by=['unigram_frequency'], ascending=True).plot(kind = 'barh', figsize=(12,20)))
plt.title('Unigrams: Top 100 by Frequency')
plt.ylabel('Unigram')
plt.xlabel('Frequency')
unigram_frequency
(president,) 210
(thank,) 192
(time,) 189
(much,) 185
(things,) 162
(years,) 162
(sort,) 155
(yeah,) 155
(mean,) 151
(could,) 149
(see,) 147
(want,) 142
(kind,) 140
(something,) 140
(trump,) 137
(make,) 133
(actually,) 133
(back,) 131
(even,) 127
(first,) 118
(go,) 115
(two,) 113
(need,) 113
(come,) 112
(thing,) 111
(government,) 110
(new,) 109
(many,) 106
(women,) 103
(still,) 101
(take,) 100
(country,) 98
(states,) 95
(fact,) 95
(day,) 94
(us,) 94
(got,) 92
(look,) 91
(every,) 90
(border,) 90
(may,) 89
(house,) 87
(yes,) 86
(around,) 85
(public,) 84
(good,) 84
(question,) 84
(world,) 83
(saying,) 82
(part,) 82
(might,) 81
(work,) 81
(case,) 79
(united,) 79
(put,) 79
(point,) 78
(whether,) 78
(last,) 77
(never,) 76
(important,) 73
(end,) 73
(trying,) 73
(sure,) 71
(big,) 70
(state,) 70
(american,) 69
(different,) 67
(thanks,) 67
(course,) 66
(school,) 66
(money,) 66
(wall,) 65
(says,) 64
(feel,) 64
(system,) 64
(little,) 63
(seen,) 63
(absolutely,) 63
(another,) 63
(made,) 63
(white,) 63
(used,) 62
(today,) 61
(ca,) 61
(problem,) 61
(talking,) 58
(year,) 57
(getting,) 57
(able,) 56
(mueller,) 56
(went,) 55
(number,) 55
(deal,) 55
(place,) 54
(security,) 54
(justice,) 54
(called,) 53
(talk,) 53
(life,) 53
(believe,) 53
<matplotlib.axes._subplots.AxesSubplot at 0x161f9e52710>
Out[18]:
Text(0.5,0,'Frequency')
In [19]:
#Wordcloud

#Include all grams
unigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 1)).value_counts()), columns=["unigram_frequency"])

#Index to column
unigrams['gram_tuples'] = unigrams.index

#Remove tuples on index column values for WordCloud ingest
unigrams['gram_tuples'] = unigrams['gram_tuples'].str.join(' ').str.strip()

#Set dataframe for Wordcloud
data = unigrams.set_index('gram_tuples').to_dict()['unigram_frequency']

#Generate Wordcloud
wc = WordCloud(width=1000, height=1000).generate_from_frequencies(data)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title("Unigram WordCloud by Frequency")
plt.show()

Bigrams

In [20]:
#Bigrams
bigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 2)).value_counts())[:50], columns=["bigram_frequency"])
display(bigrams)

#Bigram chart
display(bigrams.sort_values(by=['bigram_frequency'], ascending=True).plot(kind = 'barh', figsize=(12,10)))
plt.title('Bigrams: Top 50 by Frequency')
plt.ylabel('Bigram')
plt.xlabel('Frequency')
bigram_frequency
(united, states) 73
(thank, much) 41
(president, trump) 40
(little, bit) 30
(minimum, wage) 30
(new, york) 29
(attorney, general) 29
(special, counsel) 26
(years, ago) 24
(thank, thank) 22
(make, sure) 22
(north, korea) 22
(two, years) 20
(every, day) 20
(white, house) 18
(trump, administration) 16
(donald, trump) 16
(national, security) 15
(yeah, yeah) 15
(last, year) 14
(first, time) 14
(thank, ari) 14
(social, media) 14
(border, patrol) 14
(every, single) 14
(islamic, state) 14
(world, cup) 14
(thank, michel) 14
(border, security) 13
(bob, mueller) 13
(supreme, court) 13
(el, paso) 13
(mary, louise) 12
(theresa, may) 12
(around, world) 12
(obstruction, justice) 12
(department, justice) 12
(go, back) 12
(even, though) 11
(high, school) 11
(yeah, mean) 11
(saudi, arabia) 11
(house, senate) 11
(pleasure, thank) 11
(fake, news) 11
(many, many) 10
(end, day) 10
(justice, department) 10
(michael, cohen) 10
(north, koreans) 10
<matplotlib.axes._subplots.AxesSubplot at 0x161f9dfd860>
Out[20]:
Text(0.5,0,'Frequency')
In [21]:
#Wordcloud

#Include all grams
bigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 2)).value_counts()), columns=["bigram_frequency"])

#Index to column
bigrams['gram_tuples'] = bigrams.index

#Remove tuples on index column values for WordCloud ingest
bigrams['gram_tuples'] = bigrams['gram_tuples'].str.join(' ').str.strip()

#Set dataframe for Wordcloud
data = bigrams.set_index('gram_tuples').to_dict()['bigram_frequency']

#Generate Wordcloud
wc = WordCloud(width=1000, height=1000).generate_from_frequencies(data)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title("Bigram WordCloud by Frequency")
plt.show()

Trigrams

In [22]:
#Trigrams
trigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 3)).value_counts())[:50], columns=["trigram_frequency"])
display(trigrams)

#Trigram chart
display(trigrams.sort_values(by=['trigram_frequency'], ascending=True).plot(kind = 'barh', figsize=(12,10)))
plt.title('Trigrams: Top 50 by Frequency')
plt.ylabel('Trigram')
plt.xlabel('Frequency')
trigram_frequency
(new, york, times) 8
(kim, jong, un) 7
(love, love, love) 7
(north, korea, united) 6
(president, united, states) 6
(korea, united, states) 6
(thank, thank, much) 5
(attorney, general, barr) 5
(minimum, wage, increase) 4
(new, minimum, wage) 4
(last, two, years) 4
(attorney, general, rod) 4
(general, rod, rosenstein) 4
(government, united, states) 4
(every, single, day) 4
(federal, minimum, wage) 4
(special, counsel, mueller) 4
(minimum, wage, new) 4
(past, couple, years) 4
(yeah, yeah, yeah) 4
(new, york, city) 4
(thank, much, thanks) 4
(warm, ocean, water) 4
(black, brown, kids) 3
(order, new, election) 3
(percent, likely, receive) 3
(trump, inner, circle) 3
(department, homeland, security) 3
(three, years, ago) 3
(end, obama, administration) 3
(authorities, new, york) 3
(state, attorney, office) 3
(pleasure, thank, much) 3
(many, many, years) 3
(unidentified, actor, character) 3
(mary, louise, good) 3
(elite, public, schools) 3
(thank, much, thank) 3
(thank, much, pleasure) 3
(thank, audie, thank) 3
(declare, national, emergency) 3
(specialized, high, schools) 3
(every, day, wake) 3
(thanks, much, thank) 3
(attorney, general, bill) 3
(special, counsel, office) 3
(see, mueller, report) 3
(hey, mary, louise) 3
(general, bill, barr) 3
(house, oversight, committee) 3
<matplotlib.axes._subplots.AxesSubplot at 0x161fcb2ce48>
Out[22]:
Text(0.5,0,'Frequency')
In [23]:
#Wordcloud

#Include all grams
trigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 3)).value_counts()), columns=["trigram_frequency"])

#Index to column
trigrams['gram_tuples'] = trigrams.index

#Remove tuples on index column values for WordCloud ingest
trigrams['gram_tuples'] = trigrams['gram_tuples'].str.join(' ').str.strip()

#Set dataframe for Wordcloud
data = trigrams.set_index('gram_tuples').to_dict()['trigram_frequency']

#Generate Wordcloud
wc = WordCloud(width=1000, height=1000).generate_from_frequencies(data)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title("Trigram WordCloud by Frequency")
plt.show()

Quadgrams

In [24]:
#Quadgrams
quadgrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 4)).value_counts())[:50], columns=["quadgram_frequency"])
display(quadgrams)

#Quadgram chart
display(quadgrams.sort_values(by=['quadgram_frequency'], ascending=True).plot(kind = 'barh', figsize=(12,10)))
plt.title('Quadgrams: Top 50 by Frequency')
plt.ylabel('Quadgram')
plt.xlabel('Frequency')
quadgram_frequency
(love, love, love, love) 6
(north, korea, united, states) 6
(attorney, general, rod, rosenstein) 4
(attorney, general, bill, barr) 3
(special, counsel, bob, mueller) 2
(libertarian, argument, government, set) 2
(leads, reduction, job, growth) 2
(bill, likely, first, pieces) 2
(percent, extra, heat, trapped) 2
(eight, states, minimum, wage) 2
(standard, living, pretty, much) 2
(meaningful, grand, scheme, things) 2
(find, even, someone, working) 2
(important, recognize, economics, research) 2
(general, rod, rosenstein, took) 2
(million, workers, across, country) 2
(wage, set, state, legislature) 2
(government, set, standards, around) 2
(try, vote, minimum, wage) 2
(increase, new, york, city) 2
(established, back, intended, living) 2
(may, minimum, wage, increase) 2
(united, states, united, states) 2
(even, hour, basically, anywhere) 2
(point, probably, economically, meaningful) 2
(minimum, wage, popular, nationally) 2
(set, state, legislature, legislation) 2
(working, full, year, probably) 2
(consider, modest, adequate, standard) 2
(simply, businesses, choose, pay) 2
(costs, come, higher, minimum) 2
(likely, receive, abuse, white) 2
(general, bill, barr, deputy) 2
(water, percent, extra, heat) 2
(mean, special, counsel, mueller) 2
(policy, institute, called, family) 2
(basically, anywhere, country, next) 2
(white, women, black, women) 2
(minimum, wage, new, level) 2
(genuinely, concerned, increase, labor) 2
(former, fbi, director, jim) 2
(businesses, choose, pay, workers) 2
(welfare, workers, may, minimum) 2
(could, first, ones, try) 2
(net, weeks, hours, working) 2
(year, net, weeks, hours) 2
(pass, budget, without, house) 2
(federal, minimum, wage, bill) 2
(much, prices, gone, preceding) 2
(likely, first, pieces, legislation) 2
<matplotlib.axes._subplots.AxesSubplot at 0x161fcb0aba8>
Out[24]:
Text(0.5,0,'Frequency')
In [25]:
#Wordcloud

#Include all grams
quadgrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 4)).value_counts()), columns=["quadgram_frequency"])

#Index to column
quadgrams['gram_tuples'] = quadgrams.index

#Remove tuples on index column values for WordCloud ingest
quadgrams['gram_tuples'] = quadgrams['gram_tuples'].str.join(' ').str.strip()

#Set dataframe for Wordcloud
data = quadgrams.set_index('gram_tuples').to_dict()['quadgram_frequency']

#Generate Wordcloud
wc = WordCloud(width=1000, height=1000).generate_from_frequencies(data)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title("Quadgram WordCloud by Frequency")
plt.show()

Pentgrams

In [26]:
#Pentgrams
pentgrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 5)).value_counts())[:50], columns=["pentgram_frequency"])
display(pentgrams)

#Pentgram chart
display(pentgrams.sort_values(by=['pentgram_frequency'], ascending=True).plot(kind = 'barh', figsize=(12,10)))
plt.title('Pentgrams: Top 50 by Frequency')
plt.ylabel('Pentgram')
plt.xlabel('Frequency')
pentgram_frequency
(love, love, love, love, love) 5
(living, pretty, much, jurisdiction, country) 2
(eight, states, minimum, wage, went) 2
(percent, likely, receive, abuse, white) 2
(ballot, box, chose, raise, state) 2
(unidentified, actor, character, speaking, german) 2
(minimum, wage, money, need, come) 2
(wage, went, simply, automatic, adjustment) 2
(energy, warm, ocean, water, percent) 2
(policy, institute, called, family, budget) 2
(passed, established, new, minimum, wage) 2
(percent, extra, heat, trapped, inside) 2
(folks, genuinely, concerned, increase, labor) 2
(working, full, year, probably, need) 2
(businesses, choose, pay, workers, purely) 2
(automatically, adjusted, account, increase, prices) 2
(minimum, wage, set, state, legislature) 2
(ones, try, vote, minimum, wage) 2
(words, voters, directly, ballot, box) 2
(wage, set, state, legislature, legislation) 2
(end, year, net, weeks, hours) 2
(economically, meaningful, grand, scheme, things) 2
(raised, federal, minimum, wage, infrequently) 2
(pay, workers, purely, libertarian, argument) 2
(vote, minimum, wage, popular, nationally) 2
(standard, living, pretty, much, jurisdiction) 2
(minimum, wage, increase, leads, reduction) 2
(argument, government, set, standards, around) 2
(gap, today, need, living, wage) 2
(modest, adequate, standard, living, pretty) 2
(economic, policy, institute, called, family) 2
(anything, increase, alaska, hour, increase) 2
(firing, former, fbi, director, jim) 2
(someone, working, full, year, probably) 2
(inadequately, gap, today, need, living) 2
(recognize, economics, research, minimum, wage) 2
(lift, wages, million, workers, across) 2
(pieces, legislation, house, democrats, introduce) 2
(could, first, ones, try, vote) 2
(wage, impact, jobs, fairly, small) 2
(preceding, year, minimum, wage, automatically) 2
(concerned, increase, labor, costs, come) 2
(remember, left, home, pearl, cadillac) 2
(wage, new, minimum, wage, set) 2
(earning, significantly, higher, wage, otherwise) 2
(tell, takes, consider, modest, adequate) 2
(ocean, water, percent, extra, heat) 2
(competitors, facing, additional, labor, costs) 2
(year, minimum, wage, automatically, adjusted) 2
(expect, federal, minimum, wage, bill) 2
<matplotlib.axes._subplots.AxesSubplot at 0x161fcaca0b8>
Out[26]:
Text(0.5,0,'Frequency')
In [27]:
#Wordcloud

#Include all grams
pentgrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 5)).value_counts()), columns=["pentgram_frequency"])

#Index to column
pentgrams['gram_tuples'] = pentgrams.index

#Remove tuples on index column values for WordCloud ingest
pentgrams['gram_tuples'] = pentgrams['gram_tuples'].str.join(' ').str.strip()

#Set dataframe for Wordcloud
data = pentgrams.set_index('gram_tuples').to_dict()['pentgram_frequency']

#Generate Wordcloud
wc = WordCloud(width=1000, height=1000).generate_from_frequencies(data)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title("Quadgram WordCloud by Frequency")
plt.show()
In [28]:
#End of Script