PBS Newshour - Data Collection and Analysis

Written: July 2018 (Github)


Here is a summary of my work on PBS Newshour transcripts. If you would like to see more analysis along with the data collection and preprocessing steps, please see my repo on Github.

Data Collection

The transcripts are taken from PBS Newshour's website. There is no centralized location for each transcript, but PBS organizes every clip in its database. And a news story clip tends to have a transcript accomadating it.

The clips are organized very nicely with the following URL: "https://www.pbs.org/newshour/video/page/PAGE_NUMBER"

Data overview

As shown in Figure 1, there aren't many transcripts available before 2011. 2011 and later we see around 180 transcripts a month.

Figure 1: The number of transcripts available on PBS.

Total words spoken by person

Before doing any analysis, it's important to check that we have enough data to form meaningful conclusions. We can see from figure 2 the number of words spoken by each famous person. Besides Angela Merkel, it looks like we have plenty of data to work with!

Figure 2: The number of articles appeared/words spoken of a individual.

Topic Popularity

It's interesting to look at how the focus of politics changes over time. For each month availble we can look through transcripts and count mentions of a topic. Then we can plot these counts and see what topics are becoming increasingly important. From figure 3 it looks like both racism and immigration are gaining a lot of air time. However, they're no where close to the meteoric rise of Trump.

Figure 3: The popularity of a topic over time.


Thank you for viewing this summary, if you would like to see the complete, more extensive version, please see my repo on Github.