Scraping YouTube data with and without Google's API

Trying different methods to retrieve metadata and stats of online videos

Posted by Dani Madrid-Morales on January 28, 2017

Lately I have been working on a side project, that looks at one of CGTN-Africa's TV shows, Faces of Africa and the way it portrays African countries (and Sino-African relations) to investigate whether it does so differently or similarly to US and European mainstream media. Since most of the videos from the show are available on YouTube, I wanted to quickly retrieve basic information about each of them (length, date, description...) to get an rough idea of the main themes. I started looking at possible ways to extract the data using Google's API and R, and found a neat package called "tuber". Tuber comes with the same limitations of the API: the number of queries is limited and you have quotas per query. In most cases tuber should be enough to get the information one needs. Tuber offers simple to use functions to retrieve number of times a video has been watched, the number of positive and negative votes, the title and description of videos... However, there is no function to retrieve more technical data, such as the format an duration of the videos. Since I wanted both types of information I decided to test two ways to customize the scraper in order to get all the information that I needed.

Using YouTube's API

To use tuber, the first thing that is needed is an API OAuth (note that tuber does not use an API key, but requires both a userID and a authID) that can be obtained at the Google Developers Console. With this, anybody gets 1 million "points" as a daily quota. These are enough in most cases. I wanted to get all videos uploaded in a given channel. For this, the process involves three steps:

  1. Get the ID for the "uploads" playlist, which basically means all videos uploaded to a channel
  2. Use the ID to generate a list of all the video IDs in the channel
  3. Loop through the list to extract the data for each video

This is what the code looks like using tuber, curl and jsonlite (to convert the data we get without using tuber, as API direct calls return JSON files):


#Get all the playlists in a channel
f = list_channel_resources(filter = c(channel_id = "UCHBDXQDmqnaqIEPdEapEFVQ"), part="contentDetails")
#Extract the ID of the uploads playlist (i.e. all the videos in a channel)
playlist_id = f$items[[1]]$contentDetails$relatedPlaylists$uploads
#Loop through the list to grab all IDs in the playlist
nextoken = ""
ids=c()
vids = get_playlist_items(filter= c(playlist_id=playlist_id), page_token = nextoken)
for(i in 1:as.integer((vids$pageInfo$totalResults/50)+1)){
  vids = get_playlist_items(filter= c(playlist_id=playlist_id), page_token = nextoken)
  nextoken = vids$nextPageToken
  vid_ids = as.vector(unlist(sapply(vids$items, "[", "contentDetails")))
  ids = append(ids, vid_ids)
    }
#Clean the list, remove the datestamps and keep unique IDs, as soom are repeated
allids = ids[c(TRUE, FALSE)]
unique(allids)

#Build a URL to call the API
URL_base='https://www.googleapis.com/youtube/v3/videos?id=' #this is the base URL
URL_details='&part=contentDetails&key='                     #getting contentDetail for technical metadata
URL_key='{Use your own key}'

#Loop through URLS to retrieve basic info (duration, format)
alldata = data.frame()
ptm <- proc.time()                                          #I like to time responses to the server
for(i in 1:length(allids)){
  url = paste(URL_base, allids[[i]], URL_details, URL_key, sep = "")  
  result <- fromJSON(txt=url)
  id = result$items$id
  duration = result$items$contentDetails$duration
  caption = result$items$contentDetails$caption
  definition = result$items$contentDetails$definition
  alldata = rbind(alldata, data.frame(id, duration, caption, definition))
  d = get_stats(video_id = allids[[i]])
  a = get_video_details(allids[[i]])
  e = data.frame(t(do.call(rbind.data.frame, d)))
  e$date = a$publishedAt
  e$title = a$title
  e$describe = a$description
  e$source = a$channelTitle
  e$sourceID = a$channelId
  df = rbind(df, e)
  rownames(df) = NULL
}
  s = merge(df, alldata)
proc.time() - ptm
                    

           id viewCount likeCount dislikeCount favoriteCount commentCount                     date
1 gpa3YN9OC2s       765         6            0             0            0 2017-01-25T08:47:20.000Z
2 y0Ev9LBD6Tc       686         6            0             0            0 2017-01-25T08:17:02.000Z
3 rog8bizCwvc        99         3            1             0            0 2017-01-25T07:39:41.000Z
4 gpa3YN9OC2s       803         6            0             0            0 2017-01-25T08:47:20.000Z
5 y0Ev9LBD6Tc       686         6            0             0            0 2017-01-25T08:17:02.000Z
6 rog8bizCwvc        99         3            1             0            0 2017-01-25T07:39:41.000Z
> 
                    

Tuber makes the whole process fairly easy so in most cases, it should be enough to get all the information one needs. There are also some good pieces of Freeware that do the trick, but I wanted to write my own script that would bypass the API to get the data. Scraping YouTube directly is not the fastest way to get data, but it gets the job done fairly simply. For my research I deal with channels that have thousands and tens of thousands of videos. For these big projects, quotas run out fairly quickly.

Scraping YouTube without the API

I built a very simple script around rvest to scrape the elements and xml2 to read the html file. It simply loops over a list of links and extracts the values from either the metadata or the body of the page and then stores them into a df structure. It is nothing fancy but it does the trick, particularly if we already have a previous list with all the URLs we want to loop through.


#Read a single URL
youtube_url = read_html("https://www.youtube.com/watch?v=sb-NRYmm79g")
## Manually build a list of videos
youtube_list = data.frame(url=c("https://www.youtube.com/watch?v=sb-NRYmm79g",
                            "https://www.youtube.com/watch?v=EA9_oNGsW9k",
                            "https://www.youtube.com/watch?v=S6m8oSYjvfs",
                            "https://www.youtube.com/watch?v=Mkn9AbISAb8"))
#Read a list of links from a CSV file with a column named url with all the URLS to mine
youtube_list = read.csv("FoA_Links.csv", header = TRUE, sep = ";")

#Setting up empty df to store data
    temp.df = data.frame(id="", date="", title="", duration="", mins="", secs="",
                         description="", views= "", pos="", neg="", fullurl="")
    youtube.df = data.frame() #Will include the final outcome

#Loop through the list of links and extract some general metadata
for(i in 1:length(youtube_list$url)){
    youtube_url = read_html(as.character(youtube_list$url[[i]]))
    id = as.character(html_nodes(youtube_url, 'meta[itemprop="videoId"]') %>% 
                    html_attr("content"))
    date = as.character(html_nodes(youtube_url, 'meta[itemprop="datePublished"]') %>% 
                    html_attr("content"))
    title = as.character(html_nodes(youtube_url, 'meta[itemprop="name"]') %>% 
                    html_attr("content"))
    mins = as.numeric(gsub("M","",str_extract(as.character(html_nodes(youtube_url, 'meta[itemprop="duration"]') %>% 
                    html_attr("content")), "\\d*M")))
    secs = as.numeric(gsub("S","",str_extract(as.character(html_nodes(youtube_url, 'meta[itemprop="duration"]') %>% 
                    html_attr("content")), "\\d*S")))
    duration = (mins*60) + secs
    description = as.character(html_node(youtube_url, '#eow-description') %>% 
                                  html_text())
    views = as.numeric(html_nodes(youtube_url, 'meta[itemprop="interactionCount"]') %>% 
                    html_attr("content"))  
    try({
      pos = html_nodes(youtube_url, 'span.yt-uix-button-content') %>% 
        html_text()
      pos = as.numeric(gsub(",", "", pos[15]))}, silent = TRUE)
    if(length(pos)==0){
      pos=NA
    }
    try({
      neg = html_nodes(youtube_url, 'span.yt-uix-button-content') %>% 
        html_text()
      neg = as.numeric(gsub(",", "", neg[18]))}, silent = TRUE)
    if(length(neg)==0){
      neg=NA
    }
    fullurl = paste("https://www.youtube.com/watch?v=",id, sep="")
#Saves output into a df and appends the data to the final df
    temp.df = data.frame(id, date, title, duration, description, mins, secs, views, pos, neg, fullurl)
    youtube.df = rbind(youtube.df, temp.df)

#Empties all the fields before creating a new entry 
    temp.df = data.frame(id="", date="", title="", duration="", mins="", secs="", description="",  
                        views= "", pos="", neg="", fullurl="")
#Clear all temp variables    
    remove(id, date, title, duration, description,views, pos, neg, fullurl, mins, secs)
}
#Delete temporary df
    remove(temp.df, youtube_url, i, youtube_list)