To architect low cost and well-performing server, many companies use cloud service such as Amazon AWS, Google clound platform (GCP). I have used AWS EC2 with GPU and S3 storage for my deep learning research at Soundcorset.

AWS and GCP opened many cloud platform services, and to build the data pipeline and to manage the data effectively, need to learn the command line tool and API. In this post, I will discuss the Google Youtube data API because recently I studied. Well, Amazon might have the similar functional services and the company is leading this cloud service business with their good products, but their guides still need some room for improvement, and Google’s service is more Pythonist friendly and I like it.

Our starting point is a sample of Youtube search by keyword.

#!/usr/bin/python

from apiclient.discovery import build
from apiclient.errors import HttpError
from oauth2client.tools import argparser


# Set DEVELOPER_KEY to the API key value from the APIs & auth > Registered apps
# tab of
#   https://cloud.google.com/console
# Please ensure that you have enabled the YouTube Data API for your project.
DEVELOPER_KEY = "REPLACE_ME"
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"

def youtube_search(options):
  youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION,
    developerKey=DEVELOPER_KEY)

  # Call the search.list method to retrieve results matching the specified
  # query term.
  search_response = youtube.search().list(
    q=options.q,
    part="id,snippet",
    maxResults=options.max_results
  ).execute()

  videos = []
  channels = []
  playlists = []

  # Add each result to the appropriate list, and then display the lists of
  # matching videos, channels, and playlists.
  for search_result in search_response.get("items", []):
    if search_result["id"]["kind"] == "youtube#video":
      videos.append("%s (%s)" % (search_result["snippet"]["title"],
                                 search_result["id"]["videoId"]))
    elif search_result["id"]["kind"] == "youtube#channel":
      channels.append("%s (%s)" % (search_result["snippet"]["title"],
                                   search_result["id"]["channelId"]))
    elif search_result["id"]["kind"] == "youtube#playlist":
      playlists.append("%s (%s)" % (search_result["snippet"]["title"],
                                    search_result["id"]["playlistId"]))

  print "Videos:\n", "\n".join(videos), "\n"
  print "Channels:\n", "\n".join(channels), "\n"
  print "Playlists:\n", "\n".join(playlists), "\n"


if __name__ == "__main__":
  argparser.add_argument("--q", help="Search term", default="Google")
  argparser.add_argument("--max-results", help="Max results", default=25)
  args = argparser.parse_args()

  try:
    youtube_search(args)
  except HttpError, e:
    print "An HTTP error %d occurred:\n%s" % (e.resp.status, e.content)

We will modify this youtube_search function a bit suitable for our taste. Let us break this out and understand piece by piece.

Above all, remove the paragraph. We do not need it

if __name__ == "__main__":
  argparser.add_argument("--q", help="Search term", default="Google")
  argparser.add_argument("--max-results", help="Max results", default=25)
  args = argparser.parse_args()

  try:
    youtube_search(args)
  except HttpError, e:
    print "An HTTP error %d occurred:\n%s" % (e.resp.status, e.content)

WARNING : You need to write your own key instead of “REPLACE_ME”.1

DEVELOPER_KEY = "REPLACE_ME"

and change the search_response as follows.

search_response = youtube.search().list(
    #q=options.q,
    q = "Jacob Satorius",
    pageToken = token,
    part="id,snippet",
    maxResults=50
  ).execute()

  nextoken= search_response["nextPageToken"]

q is the keyword and you can put your own keyword. I chose a name of the musician. Do you know how many videos appear by the keyword, “Jacob Satorius”? It is about 25000 and I know it using API. The maxResults is the maximum number of items should be returned in the result set, and it is limited to 50 by the API service. I hate it, and that is why I chose one more argument of list(), pageToken. It is kind of an identity for the search page. list() has more possible parameters and can be seen here.

Now let’s add return line for the youtube_search function, and we do not need the three print lines.

  #print "Videos:\n", "\n".join(videos),"\n"
  #print "Channels:\n", "\n".join(channels), "\n"
  #print "Playlists:\n", "\n".join(playlists), "\n"
  return nextoken, videos

and also define a new function

def multipage(page):
  videos  = []
  token = None
  i = 0
  while i < page:
    videos += youtube_search(token)[1][1:]
    token = youtube_search(token)[0]
    i +=1 
  return videos

The first page has a None token. The youtube_search function return nextpagetoken and we use the token to see the next page. The function multipage do the loop and return the scraped titles of the videos.

Did I miss something? Oh, yeah.


  for search_result in search_response.get("items", []):
    if search_result["id"]["kind"] == "youtube#video":
      videos.append("%s (%s)" % (search_result["snippet"]["title"],
                                 search_result["id"]["videoId"]))
    elif search_result["id"]["kind"] == "youtube#channel":
      channels.append("%s (%s)" % (search_result["snippet"]["title"],
                                   search_result["id"]["channelId"]))
    elif search_result["id"]["kind"] == "youtube#playlist":
      playlists.append("%s (%s)" % (search_result["snippet"]["title"],
                                    search_result["id"]["playlistId"]))

This ‘for loop’ yieds the videos and channels and play lists. I did not use the channels and play lists above. The search_response has a dictionay data type. I visualized the structure of the dictionary via one of the result. I just used online website for json visualization, and you could also install open source for it.

png

You see the keys of the dictinary in the figure. Items and pageifo are also dictionaries themselves. I will choose one of the item (a video clip information) and it is as follows.

png

So much useless informations for us. I would take only titles, but to reach the title information you needed the loop and if else statements.

I said there are 25k videos and then we need 500 pages. It will make an error of limited quotas since number of quotas are limited by the service. I do not know how we can scrape all 500 pages at once yet. Try 2 or 3 pages first. It will print the first 100 or 150 titles of the vidoes.

videos= multipage(2)
print(videos)

Other important informations of the video such as the counts of “view” and “ratings” are missing here. You need to see the “videos API” instead of “search API” for the task. I will leave it to you as an exercise.


  1. I have one, and do not remember how I obtain it. I got it only one or two weeks ago, and already forgot it. It means that it is not going to be difficult at all. Moreover, we want to focus on more interesting topic than how to get it. ↩︎