By Eric Stromgren | November 5, 2021
Summary
This natural language processing (NLP) analysis of National Public Radio (NPR) transcripts samples 'All Things Considered' two-speaker conversations to reveal frequently discussed topics by show guests for the first quarter of 2019. It also classifies sentiment for hosts and guests during the sample period.
The analysis revealed a high frequency of discussions involving United States political topics. Frequent subjects included President Donald Trump, the Trump Administration, U.S. southern border security, federal minimum wage, Attorney General Bill Barr, Attorney General Rod Rosenstein, the special counsel investigation by Bob Mueller, North Korea, Saudi Arabia and the Islamic State.
Guest sentiment was more likely to be positive or neutral than negative, while host sentiment was more likely to be netural than positive or negative. A hypothesis for this result is guests responded more positively to a more objective host interview style.
Methodology
The primary method driving this NLP analysis is the data transformation of 'All Things Considered' program conversation speech into n-grams. This approach allows for identification of frequent content topics during the sample period.
The sample period, the first quarter of 2019, includes 201 'All Things Considered' episodes containing 6,926 guest utterances and 4,672 host utterances. The sample period of one quarter is chosen as a processing example and is narrow enough to provide relevant slice-of-time insights. Ultimately, other time slices can be used to analyze topic frequency shifts over time. For this case, processing the entire 20-year population of transcript data did not seem logical because doing so would likely hide seasonality. Topics discussed in 1999 may not be relevant in 2019, for example.
A sentiment analysis using the out-of-the-box VADER model from the NLTK Python library is used to reveal attitudes of the guests and hosts. One would anticipate a mainstream media outlet to practice objective journalism, and there is some support for this hypothesis as host sentiment was classified 34% positive, 49% neutral and 17% negative. Media guests likely have some agenda or bias to push, and there is some support for this hypothesis as guest sentiment was classified 39% positive, 39% neutral and 22% negative. In sum, hosts tended to be more neutral than guests while guests tended to react with more positivity than negativity.
In conclusion, the NPR dataset has similarities with a telecommunications company customer support call center. This analysis is viewed through the lens of a call center structure where company agents and customers interact through conversation to resolve account-related issues. NPR programs can be viewed as a proxy for the company's lines of business, NPR hosts can be viewed as a proxy for call center agents and NPR program guests can be viewed as a proxy as customers. This type of an NLP analysis focusing on topic frequency can help a telcommunications business surface potential product feature issues experienced by customers and identify sentiment.
Next Steps
Understanding what customers are frequently talking about – NPR guests in this case – is a logical first step in identifying product features impacting customers. The sentiment analysis can be improved by developing a customized sentiment dictionary tailored for this specific NPR dataset. A customized dictionary would likely improve sentiment scoring classification accuracy.
This analysis focuses on one line of business – 'All Things Considered' in this case – but could be easily applied to other lines of business across different time segments to answer various business questions. For NPR, an increase in topic frequency over time may reflect the movement of public interest and programming decisions. For a telecommunications call center, a topic frequency decrease may indicate lower call volume and decreased friction between product features and customers.
Source: https://www.kaggle.com/shuyangli94/interview-npr-media-dialog-transcripts/
episodes.csv: Metadata about all scraped episodes of NPR podcasts, from 1999 to 2019
headlines.csv: Headlines along with their ID.
host-map.json: Dictionary of host ID: name (lowercase name), episodes (list of episode IDs hosted), programs (list of programs hosted)
host_id.json: Dictionary of lowercase host name : host ID
splits-ns2.json: Dictionary of split ("train", "valid", or "test") : list of episode IDs for that split
utterances-sp2.csv: Utterance-level breakdown of all 2-speaker conversations
utterances.csv: Conversation turns for every episode (multi-speaker included)
###libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import nltk
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from wordcloud import WordCloud
#Increase cell width for browser view on Jupyter Notebook
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
#Display Max Rows on Dataframes = 100
pd.set_option('display.max_rows', 100)
#read JSON
with open('D:/data_projects/npr_analysis/data/host_map.json', 'r') as myfile:
data=myfile.read()
# parse file
obj = json.loads(data)
#display(obj)
##Read Data
episodes = pd.DataFrame(pd.read_csv('D:/data_projects/npr_analysis/data/episodes.csv'))
print("Episodes")
display(pd.DataFrame(episodes.head(n=3)))
headlines = pd.DataFrame(pd.read_csv('D:/data_projects/npr_analysis/data/headlines.csv'))
print("Headlines")
display(pd.DataFrame(headlines.head(n=3)))
utterances = pd.DataFrame(pd.read_csv('D:/data_projects/npr_analysis/data/utterances.csv'))
print("Utterances")
display(pd.DataFrame(utterances.head(n=3)))
utterances_2 = pd.DataFrame(pd.read_csv('D:/data_projects/npr_analysis/data/utterances-2sp.csv'))
print("2 Speaker Utterances")
display(pd.DataFrame(utterances_2.head(n=3)))
#Count of Shows
print("Episode Count: " + str(episodes.shape[0]))
#Program Counts with Visualization
display(pd.DataFrame(episodes['program'].value_counts()))
#Program Count Chart
display(episodes.program.value_counts().plot(kind = 'bar'))
plt.style.use('seaborn')
plt.title("Program Counts")
#Episode Dates
#There are 5,737 unique program dates. There are 7,300 calendar days in 20 years. That means there are 1,563 days with no shows, equivalent of 4.28 years and ~78 days per year without shows.
display(pd.DataFrame(episodes['episode_date'].value_counts()))
#Convert Episode Date into datetime datatype
episodes['episode_date'] = pd.to_datetime(episodes['episode_date'])
#Count of Headlines
print("Headline Count: " + str(headlines.shape[0]))
#Takeaway: Not all episodes have headlines
#Join episodes and utterances table
df_episodes_utterances_2 = pd.merge(episodes, utterances_2, how='inner', left_on = 'id', right_on = 'episode')
df_episodes_utterances_2.head()
#Program Counts with Visualization
display(pd.DataFrame(df_episodes_utterances_2['program'].value_counts()))
#Program Count Chart
display(episodes.program.value_counts().plot(kind = 'bar'))
plt.style.use('seaborn')
plt.title("Program Counts")
#Subset to All Things Considered Programs
df_atc = df_episodes_utterances_2[(df_episodes_utterances_2["program"] == "All Things Considered")]
#Subset All Things Considered Programs to Guest Speaker Utterances Only
df_atc = df_atc[df_atc["is_host"] == False]
df_atc.head(3)
#Subset All Things Considered Guest Utterances to first quarter 2019
df_atc_q1_2019_guests = df_atc[(df_atc['episode_date'] > '2019-01-01') & (df_atc['episode_date'] < '2019-03-31')]
print("All Things Considered Guest Utterances (Population): " + str(df_atc.shape[0]))
print()
print("2019 Q1 All Things Considered Guest Utterances: " + str(df_atc_q1_2019_guests.shape[0]))
print()
print("2019 Q1 All Things Considered Guest Utterances (% of Population): " + str(round(df_atc_q1_2019_guests.shape[0]/df_atc.shape[0], 3)))
df_atc_q1_2019_guests.head(3)
#Utterances by Episode. Note, 201 total episodes.
display(pd.DataFrame(df_atc_q1_2019_guests['episode'].value_counts()))
#Utterances by Program
display(df_atc_q1_2019_guests.episode.value_counts().plot(kind = 'bar',figsize=(40,10)))
plt.style.use('seaborn')
plt.title("Utterances by Episode")
#Isolate utterances into list to feed sentiment analyzer
sentences = df_atc_q1_2019_guests['utterance'].tolist()
#Sentiment Scoring Samples
print("Sentiment Scoring Samples (Guests):")
for sentence in sentences[:3]:
sid = SentimentIntensityAnalyzer()
print(sentence)
ss = sid.polarity_scores(sentence)
for k in sorted(ss):
print('{0}: {1}, '.format(k, ss[k]), end='')
print()
#Compile Sentiment Scoring
analyzer = SentimentIntensityAnalyzer()
result = {'pos': 0, 'neg': 0, 'neu': 0}
for sentence in sentences:
score = analyzer.polarity_scores(sentence)
if score['compound'] > 0.05:
result['pos'] += 1
elif score['compound'] < -0.05:
result['neg'] += 1
else:
result['neu'] += 1
#Put scoring results into dataframe
scoring = pd.Series(result, name='utterances')
scoring.index.name = 'sentiment'
scoring.reset_index()
scoring = scoring.to_frame()
scoring = scoring.sort_values(by=['utterances'], ascending=False)
scoring['pct of observations'] = round((scoring['utterances'] / scoring['utterances'].sum()), 3)
#Show scooring Dataframe
display(scoring)
#Scoring Chart
display(scoring.utterances.plot(kind = 'bar'))
plt.style.use('seaborn')
plt.title("'All Things Considered' Guest Sentiment For 2019 Q1")
#Subset to All Things Considered Programs
df_atc = df_episodes_utterances_2[(df_episodes_utterances_2["program"] == "All Things Considered")]
#Subset All Things Considered Programs to Host Speaker Utterances Only
df_atc_hosts = df_atc[df_atc["is_host"] == True]
df_atc_q1_2019_hosts = df_atc_hosts[(df_atc_hosts['episode_date'] > '2019-01-01') & (df_atc_hosts['episode_date'] < '2019-03-31')]
#Isolate utterances into list to feed sentiment analyzer
sentences = df_atc_q1_2019_hosts['utterance'].tolist()
#Compile Sentiment Scoring
analyzer = SentimentIntensityAnalyzer()
result = {'pos': 0, 'neg': 0, 'neu': 0}
for sentence in sentences:
score = analyzer.polarity_scores(sentence)
if score['compound'] > 0.05:
result['pos'] += 1
elif score['compound'] < -0.05:
result['neg'] += 1
else:
result['neu'] += 1
#Put scoring results into dataframe
scoring = pd.Series(result, name='utterances')
scoring.index.name = 'sentiment'
scoring.reset_index()
scoring = scoring.to_frame()
scoring = scoring.sort_values(by=['utterances'], ascending=False)
scoring['pct of observations'] = round((scoring['utterances'] / scoring['utterances'].sum()), 3)
#Show scooring Dataframe
display(scoring)
#Scoring Chart
display(scoring.utterances.plot(kind = 'bar'))
plt.style.use('seaborn')
plt.title("'All Things Considered' Host Sentiment For 2019 Q1")
#Iterate over utterances to tokenize utterances
cols = ['tokens_list']
lst = []
for i in df_atc_q1_2019_guests['utterance']:
lst.append([zip])
zip = nltk.word_tokenize(i)
df_tokens = pd.DataFrame(lst, columns=cols)
df_tokens = df_tokens.iloc[1: , :]
display(df_tokens.head(3))
#Concatenate tokens into one list to remove stopwords and punctuation
token_list = df_tokens.tokens_list.sum()
token_list
#Remove punctuation
words=[word.lower() for word in token_list if word.isalpha()]
#Remove stopwords; add custom stopwords
stopwords = nltk.corpus.stopwords.words('english')
new_stop_words = ['i',
"and",
"think",
'hi',
'know',
'people',
'well',
'going',
'like',
'would',
'really',
'one',
'say',
'get',
'right',
'way',
'lot',
'said',
'also']
stopwords.extend(new_stop_words)
#Words with stopwords edited out
filtered_words = [word for word in words if word not in stopwords]
#Unigrams
unigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 1)).value_counts())[:100], columns=["unigram_frequency"])
display(unigrams)
#Unigram chart
display(unigrams.sort_values(by=['unigram_frequency'], ascending=True).plot(kind = 'barh', figsize=(12,20)))
plt.title('Unigrams: Top 100 by Frequency')
plt.ylabel('Unigram')
plt.xlabel('Frequency')
#Wordcloud
#Include all grams
unigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 1)).value_counts()), columns=["unigram_frequency"])
#Index to column
unigrams['gram_tuples'] = unigrams.index
#Remove tuples on index column values for WordCloud ingest
unigrams['gram_tuples'] = unigrams['gram_tuples'].str.join(' ').str.strip()
#Set dataframe for Wordcloud
data = unigrams.set_index('gram_tuples').to_dict()['unigram_frequency']
#Generate Wordcloud
wc = WordCloud(width=1000, height=1000).generate_from_frequencies(data)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title("Unigram WordCloud by Frequency")
plt.show()
#Bigrams
bigrams = pd.DataFrame((pd.Series(nltk.ngrams(filtered_words, 2)).value_counts())[:50], columns=["bigram_frequency"])
display(bigrams)
#Bigram chart
display(bigrams.sort_values(by=['bigram_frequency'], ascending=True).plot(kind = 'barh', figsize=(12,10)))
plt.title('Bigrams: Top 50 by Frequency')
plt.ylabel('Bigram')
plt.xlabel('Frequency')