Identifying bias in the media with sentiment analysis

07 February 2017

data-science


There is an age-old debate in the US about how the press and other media outlets may be politically biased. This has been particularly salient during the recent Clinton-Trump presidential campaign and since the election took place at the end of 2016. I have personally followed the whole campaign quite closely and have tried to regularly watch several TV news channels in order to absorb different views and perspectives. I've also developed my own subjective sense of what side of the political spectrum each channel tends to lean toward. Yet, most channels claim to be independent and unbiased (e.g. Fox News' slogan is "Fair & Balanced"). So I thought it'd be interesting to try and find objective ways to measure how biased media outlets actually are.

In this article, I describe a case study that I've recently worked on. First, I explain the study's methodology, which is based on the sentiment analysis of videos published on Youtube by a number of prominent American TV channels. I then present some details about the acquired dataset beforing laying out the study's results. Finally I discuss some limitations about the study and possible areas for improvement.

All the code that I've written to support this study has been published on Github (https://github.com/jphalip/media-bias). Feel free to download that code if you're interested in running similar studies yourself.

Methodology

For this study I've decided to analyze videos posted on Youtube by some of the most prominent American TV news channels, including the so-called "Big Three" (Fox News, CNN & MSNBC) and CBS News. One may argue that not all content aired on a given TV channel necessarily ends up being published on that channel's Youtube account, and therefore that this may skew the results. My counter-argument is that, by curating videos on their Youtube account for online consumption, a TV channel likely exposes even greater bias, therefore making the Youtube data even more pertinent for the purpose of this study.

It was fairly easy to download the video metadata for all channels (titles, descriptions, publication dates, etc.) using the Youtube API: https://github.com/jphalip/media-bias/blob/master/code/youtube_api.py

Once all the video metadata acquired, I selected a range of political topics ("Obama", "Clinton", "Trump", "Democrats", "Republicans", "Conservatives", "Liberals") and extracted the relevant videos that mentioned those topics. Titles that contained different variants of a same topic's word were flagged for that topic ("Mr. Obama Goes To Washington" and "The Obamas Vacation in N.C." were flagged for the "Obama" topic). This approach also means that, for example, videos about Melania Trump were flagged for the "Trump" topic and videos about Bill Clinton were flagged for the "Clinton" topic. Also, people's first names were not used for filtering and flagging videos as they are too generic and could have yielded false positives (e.g. "Donald" could have yielded videos about Donald Rumsfeld or Donald Sterling).

Then, I've analyzed the sentiment of all relevant video titles. The Oxford Dictionaries defines "sentiment analysis" as follows:

The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral. (Source: Oxford Dictionaries)

The goal is to find out whether sentiment analysis can help identify differences in how channels treat contentious political topics. This isn't the first time that sentiment analysis is used for that purpose (see for example this study showcased on the Washington Post). However, I think that my study is a bit unique in the way that it breaks down data into multiple topics.

When looking through the data I noticed that the video descriptions often contained some generic or irrelevant text (e.g. "Follow msnbc on Tubmlr" or "Check out FOX 411 for more entertainment news and gossip"). I wanted to prevent that irrelevant text from influencing the sentiment analysis results. Cleaning up that data would have been extremely time-consuming and tedious. So I ignored all the video descriptions and only analysed the sentiment of video titles. I would posit that Youtube users generally pay much more attention to the video titles than to their descriptions anyways.

The sentiment analysis was performed using the Google Natural Language API. It's pretty simple and straighforward: you send a piece of text to the API server, which then analyzes that text and returns a sentiment score between -1.0 (negative sentiment) and 1.0 (positive sentiment) as determined by Google's machine learning algorithms. The complete code for downloading sentiment scores for all videos and for saving the data to disk can be found here: https://github.com/jphalip/media-bias/blob/master/code/language_api.py

Note that the Google Natural Language API is not free. The function above was called for each of the ~30K relevant videos that I had collected, which in the end cost me around $30 US. If you're interested in running similar experiments, make sure to first refer to the API's pricing page for adequate budgeting.

Alright, now we're ready to do some exploration. Let's jump right in!

Data exploration

First, let's import all the code libraries that we need in order to perform our work:

In [26]:
from __future__ import division
from IPython.display import display
from datetime import datetime
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from code.utils import show_videos
from code.plotting import plot_channel_stats, plot_compressed_channel_stats, plot_sentiment_series
%matplotlib inline

Here are the channels considered for this study:

In [3]:
channels = pd.read_csv('channels.csv')
channels[['title', 'url', 'color']]
Out[3]:
title url color
0 Fox News https://www.youtube.com/user/FoxNewsChannel #5975a4
1 CNN https://www.youtube.com/user/CNN #b55d60
2 MSNBC https://www.youtube.com/user/msnbcleanforward #5f9e6e
3 CBS News https://www.youtube.com/user/CBSNewsOnline #666666

(Note: The color attributes are just arbitrary colors that will be used to visually differentiate the channels in all the graphs featured later in this article.)

Here are the topics that I've chosen to focus on:

In [4]:
topics = pd.read_csv('topics.csv')
topics[['title', 'slug']]
Out[4]:
title slug
0 Obama obama
1 Clinton clinton
2 Trump trump
3 Democrats democrats
4 Republicans republicans
5 Liberals liberals
6 Conservatives conservatives

(Note: The slug attributes are the names used for the columns in the video dataset that flag videos relating to specific topics — See the video dataset in question right below.)

Now let's take a look at the video data. Here's a small sample below:

In [5]:
all_videos = pd.read_csv('videos.csv', parse_dates=['published_at'])
all_videos.head(3)
Out[5]:
channel_youtube_id description published_at title youtube_id channel obama clinton trump democrats conservatives relevant republicans liberals
0 UCXIJgqnII2ZOINSWNOGFThA Fox News contributor explains 2017-02-03 16:59:45 Turner: New sanctions show Trump's change of h... mMwcBRjOhTE Fox News False False True False False True False False
1 UCXIJgqnII2ZOINSWNOGFThA Police arrested protesters during the event 2017-02-03 16:59:19 Violent protests erupt over conservative speak... RJCvUuoyJvE Fox News False False False False True True False False
2 UCXIJgqnII2ZOINSWNOGFThA Controversy over president's call with Austral... 2017-02-03 16:00:55 Schlapp, Williams debate Trump's tone with for... v9cx1l6GJdI Fox News False False True False False True False False

The youtube_id is the unique ID assigned by Youtube to each video (Tip: you may watch any video by visiting the URL https://www.youtube.com/watch?v=[insert the video youtube_id here]). Note that there is a separate column for each topic that is named after the topic's slug (e.g. obama, democrats, conservatives) and that flags videos as being relevant to the corresponding topic (i.e. the value is True if the topic is mentioned in the video title, or False otherwise). The relevant column is a flag that is True only if at least one of the topics is mentioned in the video titles — That column allows us to easily extract videos that are directly relevant to this study. You can see how all those columns were calculated and pre-processed by referring to the create_topic_columns() function published here: https://github.com/jphalip/media-bias/blob/master/code/utils.py

Now let's look at some general statistics about our dataset:

In [6]:
num_relevant = all_videos.relevant.sum()
num_total = all_videos.shape[0]
print 'Number of relevant videos: %s' % num_relevant
print 'Total number of videos: %s' % num_total
print 'Percentage of relevant videos: %0.2f%%' % (100*num_relevant/num_total)
Number of relevant videos: 33710
Total number of videos: 186571
Percentage of relevant videos: 18.07%

So the chosen topics are covered in about 18% of all videos ever published by the selected channels, which I'd argue is sufficiently significant for the purposes of this study. Let's now drill down a bit further and see to what extent those topics are covered overall by each channel:

In [7]:
channel_stats = pd.DataFrame({
    'relevant': all_videos.groupby('channel').relevant.sum().astype(int),
    'total': all_videos.groupby('channel').size()
})
channel_stats['percentage_relevant'] = (100*channel_stats.relevant/channel_stats.total).round(2)
channel_stats.sort_values('percentage_relevant', ascending=False)
Out[7]:
relevant total percentage_relevant
channel
MSNBC 2606 6120 42.58
Fox News 10753 28231 38.09
CNN 14855 100000 14.86
CBS News 5496 52220 10.52

Fox News and MSNBC both cover those topics quite extensively (in about 40% of all their published videos), whereas CBS News and CNN both seem to cover all sorts of other topics (probably sport, science, entertainment, etc). This indicates that Fox News and MSNC are both quite focused on politics.

We can now see quantitatively how much (in absolute numbers) each individual topic is covered by those channels:

In [10]:
absolutes = all_videos.groupby('channel')[topics.slug].sum().astype(int)
display(absolutes)
obama clinton trump democrats republicans liberals conservatives
channel
CBS News 3141 913 1545 101 175 15 34
CNN 8235 2799 4123 149 216 19 72
Fox News 987 3304 6754 273 526 73 129
MSNBC 250 663 1773 74 82 1 16

Some initial observations can be made from the above table:

  • CNN talks a lot about Obama, a lot more so than other channels.
  • The term "Liberals" in video titles seems to be mostly used by Fox News. MSNBC almost never mentions it.
  • Trump has been covered about twice as much as Clinton.

One way to evaluate how important each topic is to each channel, is to calculate the percentage of their videos covering that topic relatively to the entire number of videos that the channel has published:

In [11]:
totals = all_videos.groupby('channel').size()
relatives = 100 * absolutes.divide(totals, axis=0)
display(relatives)
obama clinton trump democrats republicans liberals conservatives
channel
CBS News 6.014937 1.748372 2.958637 0.193412 0.335121 0.028725 0.065109
CNN 8.235000 2.799000 4.123000 0.149000 0.216000 0.019000 0.072000
Fox News 3.496157 11.703447 23.924055 0.967022 1.863200 0.258581 0.456944
MSNBC 4.084967 10.833333 28.970588 1.209150 1.339869 0.016340 0.261438

Those percentages can illustrated as follows:

In [12]:
plot_channel_stats(relatives, topics, channels, title='Relative topic coverage\n(% of total # of each channel\'s videos)')

Based on those graphs we can draw a couple more observations:

  • Obama is mentioned in 3-8% of videos from all channels. That is not too surprising given that he's been president for 8 years.
  • Fox News and MSNBC have mentioned Trump in a quarter of their videos; they've mentioned Clinton twice as less.

(Note: If you're interested in checking out the code written to generate those graphs, please refer to: https://github.com/jphalip/media-bias/blob/master/code/plotting.py)

Sentiment analysis

Overall sentiments

Alright, this is where things get much more interesting. As mentioned earlier, the sentiment of all relevant videos was calculated using the Google Natural Language API. The sentiment scores were stored in a separate CSV file. Here is a small sample:

In [13]:
sentiments = pd.read_csv('sentiments.csv')
sentiments[['youtube_id', 'sentiment_score']].head()
Out[13]:
youtube_id sentiment_score
0 rkLZEHl6gtc -0.7
1 v9zqWRzaE0c 0.2
2 Yv2OzJoZtzw -0.6
3 d9CdoVvG72U 0.4
4 fcCZunx-Ayw 0.3

The youtube_id column contains the unique video IDs. The sentiment_score column contains the scores (between -1 and 1) for all relevant videos. A score of 0 would correspond to neutral sentiment, a score of -1 to extremely negative sentiment, and a score of 1 to extremely positive sentiment.

Let's merge the sentiment scores into the main video dataset and then look at a small sample of videos with positive and negative sentiments:

In [14]:
videos = all_videos[all_videos.relevant].merge(sentiments, on='youtube_id')
In [16]:
# Some videos with negative sentiment:
videos.sort_values('sentiment_score')[['channel', 'title', 'sentiment_score', 'youtube_id']].head(4)
Out[16]:
channel title sentiment_score youtube_id
15112 CNN OBAMA STATEMENT ON FORCED BUDGET CUTS- WALK UP -0.9 oy1NAgE4j2U
11081 CNN Spicer: Russia-Trump report is disgraceful -0.9 -3xtbsxOnRo
32072 CBS News Huntsman: Obama has weakened the United States -0.9 LNIYsmIoYh8
16873 CNN WH BRIEFING-OBAMA WON'T HOLD US HOSTAGE -0.9 Q_1_l8IAO1I
In [17]:
# Some videos with positive sentiment:
videos.sort_values('sentiment_score', ascending=False)[['channel', 'title', 'sentiment_score', 'youtube_id']].head(4)
Out[17]:
channel title sentiment_score youtube_id
20521 CNN OBAMA CABINET MTG - BUDGET DEAL-VERY PLEASED 0.9 casIztwAQCc
20645 CNN OBAMA W COLOMBIAN PRES- MOVING BEYOND SECURITY 0.9 sr1tlS8pltY
11527 CNN Trump's Supreme Court pick coming right after ... 0.9 MSeQieB5PnM
8165 Fox News Trump: It's amazing that I did so well in SC 0.9 1f8PKgMCFbQ

Here is how the sentiment scores are distributed across the dataset:

In [21]:
sns.distplot(videos.sentiment_score, axlabel=False, ax=plt.gca())
plt.title('Sentiment scores distribution')
plt.gca().get_yaxis().set_visible(False)
plt.xlim(-1,1)
plt.show()

Note that a majority of video titles tend to have a positive sentiment in the 0-0.5 range.

We're now in a position to visualize the average sentiment of each channel toward each topic:

In [29]:
scores = pd.DataFrame(index=channels.sort_values('title').title, columns=topics.slug, )
for channel, group in videos.groupby('channel'):
    for topic in topics.slug:
        scores.loc[channel, topic] = group[group[topic]].sentiment_score.mean()
scores = scores.rename_axis('Topic', axis=1)
scores = scores.rename_axis('Channel', axis=0)
display(scores)
plot_channel_stats(scores, topics, channels, fig_height=10, y_center=True, title='Average sentiment by topic')
Topic obama clinton trump democrats republicans liberals conservatives
Channel
CBS News 0.0765998 -0.0346112 0.181424 -0.00594059 -0.0194286 -0.1 0.185294
CNN 0.0554827 -0.0035727 0.177813 0 -0.0180556 -0.173684 0.0347222
Fox News -0.0420466 -0.0848063 0.173912 -0.0355311 -0.0524715 -0.216438 0.0806202
MSNBC 0.1216 0.0307692 0.156458 0.108108 0.1 0 0.125

For a different perspective, the same data can be compressed into the following graph:

In [28]:
plot_compressed_channel_stats(scores, y_center=True, title='Average sentiment by topic')

The graphs above allow us to make a few interesting observations:

  • MSNBC seems to have a positive tone overall.
  • Obama has been generally spoken of in positive terms everywhere except on Fox News.
  • Clinton has been spoken of in negative terms everywhere except on MSNBC.
  • Trump has overall clearly been spoken of in positive terms across the board.
  • Conservatives have generally been spoken of in positive terms.
  • Each channel seems to cover Democrats with the roughly same tone as they do Republicans.
  • Liberals have been covered in negative terms (except MSNC, which as mentioned before doesn't really use that word at all).

Evolution through time

The statistics that I've presented so far were all averages. While those give an idea of the overall sentiment on each topic, it would be quite interesting to also see how the sentiment has evolved through time, in particular over the past two years during the presidential campaign. Let's take a look at this evolution with the timeseries graphs below:

In [32]:
plot_sentiment_series(videos, topics, channels, start_date=datetime(2015, 1, 1), title='Sentiment evolution during the presidential campaign')

A few observations can be made from the above graphs:

  • MSNBC appears to consistently cover all topics in a fairly positive tone.
  • The sentiment about Republicans and Democrats regularly oscillates between positive and negative for all channels (except MSNBC).
  • Obama is consistenly spoken of negatively by Fox News, positively by MSNBC, and mixedly by CBS News and CNN.
  • Sentiment about Trump is starkly positive throughout.
  • Clinton was mostly spoken of in negative terms during the campaign. Even the sentiment on MSBNC, which is generally mostly positive, was just above neutral on Clinton.

Left-wing vs Right-wing

For better or worse, it is quite common in the US to consider the political spectrum as bi-modal: left-wing and right-wing. In other words: "Liberals vs Conservatives", "Democrats vs Republicans" or "Obama & Clinton vs Trump". To see how sentiments from our dataset are distributed across this bi-modal spectrum, we can separate left-oriented topics from right-oriented topics and then calculate the averages for each channel:

In [65]:
# Separate left-oriented topics from right-oriented topics
left_topics = ['obama', 'clinton', 'democrats', 'liberals']
right_topics = ['trump', 'republicans', 'conservatives']

# Create two new flag columns, one for each mode
videos['left'] = np.any(videos[left_topics], axis=1)
videos['right'] = np.any(videos[right_topics], axis=1)

# Calculate average sentiments for each channel
modes = ['left', 'right']
scores = pd.DataFrame(index=channels.sort_values('title').title, columns=modes)
for channel, group in videos.groupby('channel'):
    for mode in modes:
        scores.loc[channel, mode] = group[group[mode]].sentiment_score.mean()
scores = scores.rename_axis('Topic', axis=1)
scores = scores.rename_axis('Channel', axis=0)
display(scores)
Topic left right
Channel
CBS News 0.0495623 0.161105
CNN 0.0397833 0.166934
Fox News -0.0737636 0.158329
MSNBC 0.0597092 0.154019

The same results can be represented graphically as follows:

In [66]:
plot_compressed_channel_stats(scores, color=['#50AFE8', '#E61B23'], y_center=True, title='Average sentiment: Left-wing vs Right-wing')

If sentiment analysis is to be trusted, then those channels all appear to be fairly conservative. Also, the overall tone is generally positive except for one notable exception, as Fox News tends to cover left-oriented topics in negative terms.

Limitations and future improvements

I will be the first to admit that sentiment analysis, as conducted in this study, isn't perfect and does have some limitations.

On subjectivity

Sentiment analysis is sometimes criticized for its subjectivity as two persons may disagree on whether a given sentence is negative or positive.

See for example the following videos that were estimated by the Google API to have a positive sentiment score:

In [36]:
show_videos(videos, ['NDq3Ojmk0mI', 'nLU12dCJpZ8'])
Out[36]:
title sentiment_score channel published_at youtube_id
Can DC adjust to President Trump's swift speed? 0.5 Fox News 2017-01-28 23:59:53 NDq3Ojmk0mI
Did President-elect Trump inherit a divided America? 0.5 Fox News 2016-12-07 21:03:11 nLU12dCJpZ8

... and those videos with negative sentiment scores:

In [37]:
show_videos(videos, ['qwws4b22NIk', 'INNBVixrAgc'])
Out[37]:
title sentiment_score channel published_at youtube_id
How delegates felt about the Republican National Convention -0.5 Fox News 2016-07-23 22:46:41 INNBVixrAgc
"Obama Did Not Let Me Down" -0.5 CBS News 2009-12-27 17:17:29 qwws4b22NIk

You may have your own opinion on whether the videos above were accurately rated. Some level of faith must be placed in the used algorithm (in this case Google's Natural Language API) to get more objective results overall if given sufficiently large data points. As data scientist Matthew Russell says:

It’s critical to mine a large — and relevant — sample of data when attempting to measure sentiment. No particular data point is necessarily relevant. It’s the aggregate that matters. (Source)

Here the hope is that the 30K videos that were analyzed constitute a large-enough dataset. But results could certainly be improved by integrating more videos from more channels to the research.

On context

The sentiment analysis was performed for each video title in a complete vacuum, without any context. Yet, context can sometimes be critical in accurately estimating sentiment, as entrepreneur AJ Bruno puts it:

Context also plays a big role in understanding a writer’s feelings on a subject. “That movie was bad!” is definitely negative sentiment from a 50 year old film critic, but it might be glowing praise from a 17 year old boy. (Source)

In this study, it's reasonable to assume that context is somewhat narrow anyways, as all data comes from news channels (i.e. not from comedy channels that may have used sarcasm, for example). However, the approach could perhaps be improved by taking notable political events into consideration in order to apply specific weights to certain sentiment scores.

On mixed topics

Sometimes the same video title may mention multiple opposite topics (e.g. Trump and Clinton, or Democrats and Republicans). When that occurs, the same sentiment score gets applied to each of the topics covered. Sometimes that is fine if the topics are treated the same way, for example in the following videos where both Trump and Clinton are criticized, or praised at the same time:

In [38]:
show_videos(videos, ['qmNDKUk-JuE', 'pL3HOH2YG9I'])
Out[38]:
title sentiment_score channel published_at youtube_id
Why are Clinton and Trump doing so poorly with Millennials? -0.7 Fox News 2016-09-21 19:01:54 qmNDKUk-JuE
Polls show supporters of Clinton, Trump equally enthusiastic 0.7 Fox News 2016-09-16 14:02:32 pL3HOH2YG9I

This is more problematic if one topic is treated more favorably than the other in the same sentence. For example in these videos:

In [39]:
show_videos(videos, ['Wjtm14fDAjQ', 'JmCmNHEGWWs'])
Out[39]:
title sentiment_score channel published_at youtube_id
Trump slams Clinton's involvement in Wisconsin recount 0.7 Fox News 2016-11-28 15:00:29 JmCmNHEGWWs
Hillary Clinton: Happy to put my record against Trump's lies 0.6 CNN 2016-06-03 20:38:38 Wjtm14fDAjQ

For the above videos, a positive score will be applied to both topics "Clinton" and "Trump". Similarly a negative score will be applied to both topics for the following videos:

In [40]:
show_videos(videos, ['f5NSJStPxEw', 'tRQm6tGyGbs'])
Out[40]:
title sentiment_score channel published_at youtube_id
Trump: Clinton scandal worse than Watergate -0.7 Fox News 2016-10-18 13:59:05 tRQm6tGyGbs
Clinton: Trump promotes bigotry and paranoia -0.6 CBS News 2016-08-25 18:59:05 f5NSJStPxEw

Ideally, in each of those problematic instances, different scores should be applied to each topic. To implement this we could perhaps try to automatically analyze the syntax of each video title and then try to determine how each topic is impacted by the overall sentiment of that title. The Google Natural Language API actually also provides a syntax analysis service that is worth looking into for that purpose.

Currently our dataset contains about 8% of video titles that cover more than one topic at the same time:

In [42]:
(np.sum(videos[topics.slug], axis=1) > 1).sum() * 100 / len(videos)
Out[42]:
8.2198688855269797

The dataset contains about 6% of video titles that cover both left-oriented and right-oriented topics at the same time:

In [44]:
np.sum(videos.left & videos.right) * 100 / len(videos)
Out[44]:
6.422236065379253

If analyzing the syntax proves to be too difficult or tedious, then a simple solution would be to delete all problematic records in order to completely remove potential ambiguities.

Other possible improvements

Other improvements could be considered in future revisions of this study:

  • In order to more accurately make comparisons between different channels, the data could be normalized by taking into account the overall sentiment of each channel, i.e. the average sentiment score for all videos ever published by each channel (even for videos that do not cover the studied topics).
  • The algorithm I wrote to flag videos for each topic was fairly naive in that it only looked for exact matches. It's possible that some video titles contained typos and therefore were missed. The filtering algorithm could be improved by using fuzzy search instead.
  • Last but not least, it'd be interesting to extend the study to other news channels, for example: Bloomberg, ABC News, NBC News, Fusion, Free Speech TV, or the Blaze.

Conclusion

This study provided an attempt at objectively evaluating bias in the media by analyzing the sentiment of video titles published a few prominent TV channels' Youtube accounts. The approach of using sentiment analysis is certainly not perfect. It is questionable whether positive or negative sentiment on a topic necessarily represents bias toward or against that topic. Similarly, neutral sentiment may not necessarily mean neutral political bias. My hope, however, is that this study offers some direction and inspiration for future research in this field.

Most of all, I hope you enjoy reading this article as much as I did working on it. Any feedback is welcome and appreciated!