Google Search Console screenshot and LOCOMOTIVE Agency logo
Technical SEO

A Simple Tool For Saving Google Search Console Data To BigQuery

  • Written by JR Oakes

For a while now we have been wanting to find an easy way to log Google Search Console (GSC) Search Analytics data for managed websites. Google has mentioned several times that more data is coming to GSC but has been elusive when pinning down a date. There are many reasons to want to collect GSC data for yourself including:

  • Google Search Console only returns 1,000 rows and has a 90-day limit on historical data.
  • It will make the data available to other tools, so you can manipulate the data in those other tools.
  • Who knows what projects / data Google will decide to sunset or reduce access to?
  • It will mean you get to be a cool Technical SEO who has what other people don’t.

In reviewing options, it looked like it would be fairly easy to just move the GSC data to BigQuery. The obvious advantages of BigQuery are:

  • It uses the same Python API library as GSC API.
  • BigQuery data is available to Google Data Studio.
  • The pricing of BigQuery storage is ~ $24/mo for 1TB of data.
  • It is easy to interact with BigQuery via their user interface with SQL-style queries, CSV, and Google Sheets integration.

We began by researching the various Google Cloud tools (we are much more familiar with AWS) and quickly landed on using Google App Engine (GAE) along with their very sweet integration of cron actions.  In addition, GAE has credential-based access via service accounts, which means that we were able to build a tool without the need for browser-based authentication.

We are giving away the solution we came up with for free to the SEO community. It is located here: https://github.com/jroakes/gsc-logger.

From the Readme file, the script:

This script runs daily and pulls data as specified in config.py file to BigQuery. There is little to configure without some programming experience.

Generally, this script is designed to be a set-it-and-forget-it in that once deployed to app engine, you should be able to add your service account email as a full user to any GSC project and the data will be logged daily to BigQuery. By default the data is set to pull from GSC 7 days earlier every day to ensure the data is available.

The tool stores the following fields (currently restricted to US traffic, but easy to update in config.py) daily:

  • date
  • query
  • page
  • device
  • clicks
  • impressions
  • ctr
  • position

It will also try to grab 5,000 rows in each gulp and keep pulling and saving until less than 5,000 rows are returned, signaling an end to all the data.

With all that said, let’s get started showing you how to implement it.  As a warning, to follow the info below, you should have some development experience.

Setting Up Google App Engine Project

Screenshot of Google Cloud Platform - Google Search Console Logger

  1. Navigate to Google Cloud Platform Console and Create a project.
  2. In this example, we named it gsc-logger-demo.
  3. At this point, go ahead and link a billing account with your project. (You can find Billing by using the hamburger menu on the top left.)
  4. Click on APIs & Services from the same hamburger menu. Search for and enable BigQuery API and Google Search Console API.
  5. Then create a Service Account by going to API’ & Services > Credentials and clicking on Create Credentials > Service Account Key. Select New Service Account and give it a succinct name (we used GSC Logger Demo) in this demo. Select the role Project owner, and leave JSON selected as key type.  Then click Create.  A JSON file will be downloaded to your browser, save this for later.

 

Getting into the code

Google Cloud Shell for GSC logger

Most of the below steps can be done from within Google Cloud Platform using their built-in Cloud Shell.  After launching Google Cloud Shell, follow the following steps:

  1. Download the repo:
    git clone https://github.com/jroakes/gsc-logger.git
  2. Upload your credentials file (the one you downloaded earlier when creating a service account):Upload file to Google Cloud Shell
  3. Move this file into your credentials directory:
    mv gsc-logger-demo-e2e0d97384ap.json gsc-logger/appengine/credentials/
  4. Move to the appengine directory:
    cd gsc-logger/appengine
  5. Open the config file:
    nano config.py
  6. Edit the CREDENTIAL_SERVICE file name to match the file you just uploaded.
  7. Update the DATASET_ID to something you like.  Only use letters and underscores.  No spaces.
  8. Edit GSC_TIMEZONE to match your current timezone.
  9. There are two other editable items here, ALLOW_CRON_OPEN and HIDE_HOMEPAGE. These are commented for what they do, but this should ideally be adjusted after testing.
  10. After editing, hit CTRL+x, y to save modified, and enter to keep the same file name.
  11. While still in the appengine directory type the below to initialize your project.  Use the project name you selected earlier (ours was gcs-logger-demo):
    gcloud config set project <your-project-id>
  12. Then type the below to install requirements:
    pip install -t lib -r requirements.txt
  13. Then create a new Google App Engine App:
    gcloud app create

    Select the region of your choice. We chose us-east-4.

  14. Finally, you are ready to deploy your app:
    gcloud app deploy app.yaml \cron.yaml \index.yaml
  15. Answer Y to continue.
  16. The app should take a minute or so to deploy and should output a url to where your app is deployed:
    Updating service [default]...done.
    Waiting for operation [apps/gsc-logger-demo/operations/9310c527-b744-4b7c-b6b6-00a79b6c28de] to complete...done.
    Updating service [default]...done.
    Deployed service [default] to [https://gsc-logger-demo.appspot.com]
    Updating config [cron]...done.
    Updating config [index]...done.
  17. You should now be able to navigate to the Deployed service url in your browser (ours in this demo is: https://gsc-logger-demo.appspot.com).
  18. Try going to your apps homepage (image below) and /cron/ (ours in this demo is: https://gsc-logger-demo.appspot.com/cron/) page once.  The /cron/ page should return:
    {"status": "200", "data": "[]"}

    It is important to hit the /cron/ page once so that your service email can be initialized with your Google account.

Adding Sites

Google Search Console Logger Main Screen

 

If all went well, you should see the screen above when navigating to your deployed App Engine URL. You will notice there are no sites active. To add sites to pull GSC data for, simply add your service account email as a full user in GSC.  For convenience, the email is listed on your app’s homepage.

To add a user, navigate to your Dashboard for a GSC account that you have ownership access to, click on the gear icon in the upper right, and click, Users and Property Owners.   Then add a new user according to the image below.

Google Search Console Add User

Once the user is connected, you should see the site listed when you refresh your app page.

Site added to GSC Logger

 

Next Steps

Now that the app has been deployed, it should download your GSC data to BigQuery every 24 hours based on cron functionality in Google App Engine. Two things to explore next are:

  • Explore your data in BigQuery: https://bigquery.cloud.google.com. From BigQuery, you can run database queries and save to CSV, or save to Google Sheets. You can also access the historical data in your own platforms via the API.
  • Try hooking up your BigQuery data to Google Data Studio. Google provides easy integration with BigQuery from their data sources. Simply add a BigQuery data source and make it available in your reports.
  • Verify your cron jobs in GAE: https://console.cloud.google.com/appengine/taskqueues/cron. You can run your cron job from this link or you can manually go to /cron/ from your browser.

For security, you want to go back and edit the config.py file using the steps above and adjust the settings for ALLOW_CRON_OPEN and HIDE_HOMEPAGE. Primarily the ALLOW_CRON_OPEN. Setting this to False means that only Google App Engine will be able to execute the cron function and direct calls to your /cron/ endpoint should result in an Access Denied response. robots.txt is set to disallow: / for this repo, so it should not be findable in Google, but you want to be careful with exposing your managed sites, so the homepage visibility is up to you.

If you want to say thank you, please share on Twitter, follow LOCOMOTIVE Agency on Twitter, and/or suggest improvements via Github.

Thanks to Russ Jones and Derk Seymour for giving great feedback on the repo.

Update: I emailed John Mueller (John is amazing BTW) to ask if there were anything to be concerned with from Google’s standpoint from running this tool.  He said, “Go for it” and “For the next version, you might want to grab the crawl errors & sitemaps too (with their indexed URL counts). I think that’s about it with regards to what you can pull though. ”

Are you in need of expert SEO services? Contact LOCOMOTIVE Agency today.