There is an age-old debate in the US about how the press and other media outlets may be politically biased. This has been particularly salient during the recent Clinton-Trump presidential campaign and since the election took place at the end of 2016. I have personally followed the whole campaign quite closely and have tried to regularly watch several TV news channels in order to absorb different views and perspectives. I've also developed my own subjective sense of what side of the political spectrum each channel tends to lean toward. Yet, most channels claim to be independent and unbiased (e.g. Fox News' slogan is "Fair & Balanced"). So I thought it'd be interesting to try and find objective ways to measure how biased media outlets actually are.
In this article, I describe a case study that I've recently worked on. First, I explain the study's methodology, which is based on the sentiment analysis of videos published on Youtube by a number of prominent American TV channels. I then present some details about the acquired dataset beforing laying out the study's results. Finally I discuss some limitations about the study and possible areas for improvement.
All the code that I've written to support this study has been published on Github (https://github.com/jphalip/media-bias). Feel free to download that code if you're interested in running similar studies yourself.
Methodology¶
For this study I've decided to analyze videos posted on Youtube by some of the most prominent American TV news channels, including the so-called "Big Three" (Fox News, CNN & MSNBC) and CBS News. One may argue that not all content aired on a given TV channel necessarily ends up being published on that channel's Youtube account, and therefore that this may skew the results. My counter-argument is that, by curating videos on their Youtube account for online consumption, a TV channel likely exposes even greater bias, therefore making the Youtube data even more pertinent for the purpose of this study.
It was fairly easy to download the video metadata for all channels (titles, descriptions, publication dates, etc.) using the Youtube API: https://github.com/jphalip/media-bias/blob/master/code/youtube_api.py
Once all the video metadata acquired, I selected a range of political topics ("Obama", "Clinton", "Trump", "Democrats", "Republicans", "Conservatives", "Liberals") and extracted the relevant videos that mentioned those topics. Titles that contained different variants of a same topic's word were flagged for that topic ("Mr. Obama Goes To Washington" and "The Obamas Vacation in N.C." were flagged for the "Obama" topic). This approach also means that, for example, videos about Melania Trump were flagged for the "Trump" topic and videos about Bill Clinton were flagged for the "Clinton" topic. Also, people's first names were not used for filtering and flagging videos as they are too generic and could have yielded false positives (e.g. "Donald" could have yielded videos about Donald Rumsfeld or Donald Sterling).
Then, I've analyzed the sentiment of all relevant video titles. The Oxford Dictionaries defines "sentiment analysis" as follows:
The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral. (Source: Oxford Dictionaries)
The goal is to find out whether sentiment analysis can help identify differences in how channels treat contentious political topics. This isn't the first time that sentiment analysis is used for that purpose (see for example this study showcased on the Washington Post). However, I think that my study is a bit unique in the way that it breaks down data into multiple topics.
When looking through the data I noticed that the video descriptions often contained some generic or irrelevant text (e.g. "Follow msnbc on Tubmlr" or "Check out FOX 411 for more entertainment news and gossip"). I wanted to prevent that irrelevant text from influencing the sentiment analysis results. Cleaning up that data would have been extremely time-consuming and tedious. So I ignored all the video descriptions and only analysed the sentiment of video titles. I would posit that Youtube users generally pay much more attention to the video titles than to their descriptions anyways.
The sentiment analysis was performed using the Google Natural Language API. It's pretty simple and straighforward: you send a piece of text to the API server, which then analyzes that text and returns a sentiment score between -1.0
(negative sentiment) and 1.0
(positive sentiment) as determined by Google's machine learning algorithms. The complete code for downloading sentiment scores for all videos and for saving the data to disk can be found here: https://github.com/jphalip/media-bias/blob/master/code/language_api.py
Note that the Google Natural Language API is not free. The function above was called for each of the ~30K relevant videos that I had collected, which in the end cost me around $30 US. If you're interested in running similar experiments, make sure to first refer to the API's pricing page for adequate budgeting.
Alright, now we're ready to do some exploration. Let's jump right in!
Data exploration¶
First, let's import all the code libraries that we need in order to perform our work:
from __future__ import division
from IPython.display import display
from datetime import datetime
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from code.utils import show_videos
from code.plotting import plot_channel_stats, plot_compressed_channel_stats, plot_sentiment_series
%matplotlib inline
Here are the channels considered for this study:
channels = pd.read_csv('channels.csv')
channels[['title', 'url', 'color']]
title | url | color | |
---|---|---|---|
0 | Fox News | https://www.youtube.com/user/FoxNewsChannel | #5975a4 |
1 | CNN | https://www.youtube.com/user/CNN | #b55d60 |
2 | MSNBC | https://www.youtube.com/user/msnbcleanforward | #5f9e6e |
3 | CBS News | https://www.youtube.com/user/CBSNewsOnline | #666666 |
(Note: The color
attributes are just arbitrary colors that will be used to visually differentiate the channels in all the graphs featured later in this article.)
Here are the topics that I've chosen to focus on:
topics = pd.read_csv('topics.csv')
topics[['title', 'slug']]
title | slug | |
---|---|---|
0 | Obama | obama |
1 | Clinton | clinton |
2 | Trump | trump |
3 | Democrats | democrats |
4 | Republicans | republicans |
5 | Liberals | liberals |
6 | Conservatives | conservatives |
(Note: The slug
attributes are the names used for the columns in the video dataset that flag videos relating to specific topics — See the video dataset in question right below.)
Now let's take a look at the video data. Here's a small sample below:
all_videos = pd.read_csv('videos.csv', parse_dates=['published_at'])
all_videos.head(3)
channel_youtube_id | description | published_at | title | youtube_id | channel | obama | clinton | trump | democrats | conservatives | relevant | republicans | liberals | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | UCXIJgqnII2ZOINSWNOGFThA | Fox News contributor explains | 2017-02-03 16:59:45 | Turner: New sanctions show Trump's change of h... | mMwcBRjOhTE | Fox News | False | False | True | False | False | True | False | False |
1 | UCXIJgqnII2ZOINSWNOGFThA | Police arrested protesters during the event | 2017-02-03 16:59:19 | Violent protests erupt over conservative speak... | RJCvUuoyJvE | Fox News | False | False | False | False | True | True | False | False |
2 | UCXIJgqnII2ZOINSWNOGFThA | Controversy over president's call with Austral... | 2017-02-03 16:00:55 | Schlapp, Williams debate Trump's tone with for... | v9cx1l6GJdI | Fox News | False | False | True | False | False | True | False | False |
The youtube_id
is the unique ID assigned by Youtube to each video (Tip: you may watch any video by visiting the URL https://www.youtube.com/watch?v=[insert the video youtube_id here]
). Note that there is a separate column for each topic that is named after the topic's slug (e.g. obama
, democrats
, conservatives
) and that flags videos as being relevant to the corresponding topic (i.e. the value is True
if the topic is mentioned in the video title, or False
otherwise). The relevant
column is a flag that is True
only if at least one of the topics is mentioned in the video titles — That column allows us to easily extract videos that are directly relevant to this study. You can see how all those columns were calculated and pre-processed by referring to the create_topic_columns()
function published here: https://github.com/jphalip/media-bias/blob/master/code/utils.py
Now let's look at some general statistics about our dataset:
num_relevant = all_videos.relevant.sum()
num_total = all_videos.shape[0]
print 'Number of relevant videos: %s' % num_relevant
print 'Total number of videos: %s' % num_total
print 'Percentage of relevant videos: %0.2f%%' % (100*num_relevant/num_total)
Number of relevant videos: 33710 Total number of videos: 186571 Percentage of relevant videos: 18.07%
So the chosen topics are covered in about 18% of all videos ever published by the selected channels, which I'd argue is sufficiently significant for the purposes of this study. Let's now drill down a bit further and see to what extent those topics are covered overall by each channel:
channel_stats = pd.DataFrame({
'relevant': all_videos.groupby('channel').relevant.sum().astype(int),
'total': all_videos.groupby('channel').size()
})
channel_stats['percentage_relevant'] = (100*channel_stats.relevant/channel_stats.total).round(2)
channel_stats.sort_values('percentage_relevant', ascending=False)
relevant | total | percentage_relevant | |
---|---|---|---|
channel | |||
MSNBC | 2606 | 6120 | 42.58 |
Fox News | 10753 | 28231 | 38.09 |
CNN | 14855 | 100000 | 14.86 |
CBS News | 5496 | 52220 | 10.52 |
Fox News and MSNBC both cover those topics quite extensively (in about 40% of all their published videos), whereas CBS News and CNN both seem to cover all sorts of other topics (probably sport, science, entertainment, etc). This indicates that Fox News and MSNC are both quite focused on politics.
We can now see quantitatively how much (in absolute numbers) each individual topic is covered by those channels:
absolutes = all_videos.groupby('channel')[topics.slug].sum().astype(int)
display(absolutes)
obama | clinton | trump | democrats | republicans | liberals | conservatives | |
---|---|---|---|---|---|---|---|
channel | |||||||
CBS News | 3141 | 913 | 1545 | 101 | 175 | 15 | 34 |
CNN | 8235 | 2799 | 4123 | 149 | 216 | 19 | 72 |
Fox News | 987 | 3304 | 6754 | 273 | 526 | 73 | 129 |
MSNBC | 250 | 663 | 1773 | 74 | 82 | 1 | 16 |
Some initial observations can be made from the above table:
- CNN talks a lot about Obama, a lot more so than other channels.
- The term "Liberals" in video titles seems to be mostly used by Fox News. MSNBC almost never mentions it.
- Trump has been covered about twice as much as Clinton.
One way to evaluate how important each topic is to each channel, is to calculate the percentage of their videos covering that topic relatively to the entire number of videos that the channel has published:
totals = all_videos.groupby('channel').size()
relatives = 100 * absolutes.divide(totals, axis=0)
display(relatives)
obama | clinton | trump | democrats | republicans | liberals | conservatives | |
---|---|---|---|---|---|---|---|
channel | |||||||
CBS News | 6.014937 | 1.748372 | 2.958637 | 0.193412 | 0.335121 | 0.028725 | 0.065109 |
CNN | 8.235000 | 2.799000 | 4.123000 | 0.149000 | 0.216000 | 0.019000 | 0.072000 |
Fox News | 3.496157 | 11.703447 | 23.924055 | 0.967022 | 1.863200 | 0.258581 | 0.456944 |
MSNBC | 4.084967 | 10.833333 | 28.970588 | 1.209150 | 1.339869 | 0.016340 | 0.261438 |
Those percentages can illustrated as follows:
plot_channel_stats(relatives, topics, channels, title='Relative topic coverage\n(% of total # of each channel\'s videos)')
Based on those graphs we can draw a couple more observations:
- Obama is mentioned in 3-8% of videos from all channels. That is not too surprising given that he's been president for 8 years.
- Fox News and MSNBC have mentioned Trump in a quarter of their videos; they've mentioned Clinton twice as less.
(Note: If you're interested in checking out the code written to generate those graphs, please refer to: https://github.com/jphalip/media-bias/blob/master/code/plotting.py)
Sentiment analysis¶
Overall sentiments¶
Alright, this is where things get much more interesting. As mentioned earlier, the sentiment of all relevant videos was calculated using the Google Natural Language API. The sentiment scores were stored in a separate CSV file. Here is a small sample:
sentiments = pd.read_csv('sentiments.csv')
sentiments[['youtube_id', 'sentiment_score']].head()
youtube_id | sentiment_score | |
---|---|---|
0 | rkLZEHl6gtc | -0.7 |
1 | v9zqWRzaE0c | 0.2 |
2 | Yv2OzJoZtzw | -0.6 |
3 | d9CdoVvG72U | 0.4 |
4 | fcCZunx-Ayw | 0.3 |
The youtube_id
column contains the unique video IDs. The sentiment_score
column contains the scores (between -1
and 1
) for all relevant videos. A score of 0
would correspond to neutral sentiment, a score of -1
to extremely negative sentiment, and a score of 1
to extremely positive sentiment.
Let's merge the sentiment scores into the main video dataset and then look at a small sample of videos with positive and negative sentiments:
videos = all_videos[all_videos.relevant].merge(sentiments, on='youtube_id')
# Some videos with negative sentiment:
videos.sort_values('sentiment_score')[['channel', 'title', 'sentiment_score', 'youtube_id']].head(4)
channel | title | sentiment_score | youtube_id | |
---|---|---|---|---|
15112 | CNN | OBAMA STATEMENT ON FORCED BUDGET CUTS- WALK UP | -0.9 | oy1NAgE4j2U |
11081 | CNN | Spicer: Russia-Trump report is disgraceful | -0.9 | -3xtbsxOnRo |
32072 | CBS News | Huntsman: Obama has weakened the United States | -0.9 | LNIYsmIoYh8 |
16873 | CNN | WH BRIEFING-OBAMA WON'T HOLD US HOSTAGE | -0.9 | Q_1_l8IAO1I |
# Some videos with positive sentiment:
videos.sort_values('sentiment_score', ascending=False)[['channel', 'title', 'sentiment_score', 'youtube_id']].head(4)
channel | title | sentiment_score | youtube_id | |
---|---|---|---|---|
20521 | CNN | OBAMA CABINET MTG - BUDGET DEAL-VERY PLEASED | 0.9 | casIztwAQCc |
20645 | CNN | OBAMA W COLOMBIAN PRES- MOVING BEYOND SECURITY | 0.9 | sr1tlS8pltY |
11527 | CNN | Trump's Supreme Court pick coming right after ... | 0.9 | MSeQieB5PnM |
8165 | Fox News | Trump: It's amazing that I did so well in SC | 0.9 | 1f8PKgMCFbQ |
Here is how the sentiment scores are distributed across the dataset:
sns.distplot(videos.sentiment_score, axlabel=False, ax=plt.gca())
plt.title('Sentiment scores distribution')
plt.gca().get_yaxis().set_visible(False)
plt.xlim(-1,1)
plt.show()
Note that a majority of video titles tend to have a positive sentiment in the 0
-0.5
range.
We're now in a position to visualize the average sentiment of each channel toward each topic:
scores = pd.DataFrame(index=channels.sort_values('title').title, columns=topics.slug, )
for channel, group in videos.groupby('channel'):
for topic in topics.slug:
scores.loc[channel, topic] = group[group[topic]].sentiment_score.mean()
scores = scores.rename_axis('Topic', axis=1)
scores = scores.rename_axis('Channel', axis=0)
display(scores)
plot_channel_stats(scores, topics, channels, fig_height=10, y_center=True, title='Average sentiment by topic')
Topic | obama | clinton | trump | democrats | republicans | liberals | conservatives |
---|---|---|---|---|---|---|---|
Channel | |||||||
CBS News | 0.0765998 | -0.0346112 | 0.181424 | -0.00594059 | -0.0194286 | -0.1 | 0.185294 |
CNN | 0.0554827 | -0.0035727 | 0.177813 | 0 | -0.0180556 | -0.173684 | 0.0347222 |
Fox News | -0.0420466 | -0.0848063 | 0.173912 | -0.0355311 | -0.0524715 | -0.216438 | 0.0806202 |
MSNBC | 0.1216 | 0.0307692 | 0.156458 | 0.108108 | 0.1 | 0 | 0.125 |
For a different perspective, the same data can be compressed into the following graph:
plot_compressed_channel_stats(scores, y_center=True, title='Average sentiment by topic')
The graphs above allow us to make a few interesting observations:
- MSNBC seems to have a positive tone overall.
- Obama has been generally spoken of in positive terms everywhere except on Fox News.
- Clinton has been spoken of in negative terms everywhere except on MSNBC.
- Trump has overall clearly been spoken of in positive terms across the board.
- Conservatives have generally been spoken of in positive terms.
- Each channel seems to cover Democrats with the roughly same tone as they do Republicans.
- Liberals have been covered in negative terms (except MSNC, which as mentioned before doesn't really use that word at all).
Evolution through time¶
The statistics that I've presented so far were all averages. While those give an idea of the overall sentiment on each topic, it would be quite interesting to also see how the sentiment has evolved through time, in particular over the past two years during the presidential campaign. Let's take a look at this evolution with the timeseries graphs below:
plot_sentiment_series(videos, topics, channels, start_date=datetime(2015, 1, 1), title='Sentiment evolution during the presidential campaign')