Engineer's Blog

Engineer's Blog

Refine search

2016.9.16 - 

Getting started with GCP #3: Using BigQuery through BLOCKS

Last time, we introduced some basics regarding BigQuery and how to operate its web UI.
In this post, we'll show you how to start using BigQuery in MAGELLAN BLOCKS.
It's simple enough to use the web UI or bq Command-Line Tool to execute a single query once.
However, in real work situations, it will become necessary to do things like execute multiple queries or set queries to execute on a regular schedule as batches.
For that, you would usually need to write a program that utilizes the BigQuery API.
Or, by using BigQuery BLOCKS in MAGELLAN BLOCKS, you can accomplish these things without needing to program anything.
With that, let's try using MAGELLAN BLOCKS to make a query.

1. Make a Flow that executes regularly timed BigQuery queries

1-1. Connect BLOCKS

Open the Board editor and drag BLOCKS from the left menu to start connecting them.
We'll be using the following three BLOCKS:
1) Start of Flow (Basic category)
2) Execute query (BigQuery category)
3) End of Flow (Basic category)
*All Flows must start with a Start of Flow BLOCK and end with an End of Flow BLOCK.
Board editor screen (making a Flow)

1-2. Setting up timed Flow execution

Next, we'll set up our Flow to execute on a timed schedule.
By using crontab format commands, you can flexibly schedule your Flows to execute by the minute/hour/date/month/day, according to your work needs.
This time, we'll make our Flow execute every three minutes.
From the Properties menu of the Start of Flow BLOCK, set Start time to */3**** and check the box labelled Activating start times.
Board editor screen (Properties - timed execution of Flows)

1-3. Write a query

Next, we'll write the main part of our Flow, the query itself.
We'll use the scores table from the my_school dataset we created last time.
Contents of scores
From the data in this table, we want to calculate the average scores in each subject for male and female students separately.
Our query will be like the following:

SELECT
        sex,
        AVG( language_score ) as avg_language_score,
        AVG( math_score ) as avg_math_score,
        AVG( social_score ) as avg_social_score,
        AVG( sciencr_score ) as avg_sciencr_score
FROM
        my_school.scores
GROUP BY
        sex

Now, let’s add this into the Properties menu of our Execute query BLOCK.
Set Result storage data set to my_school, and Result storage table to avg_by_sex. For the In cases of non-empty tables setting, choose Overwrite.
The avg_by_sex table will be created automatically by BigQuery based on the results of our query, so there's no need to create it or set its schema beforehand.
*Don't forget to select the GCP service account that we set up in Part 1.
Board editor screen (Properties - Query)
There's one point to keep in mind here:
Within the extra settings found by clicking Advanced settings, you should find an item called Query priority level.
This is set to Batch by default. However, as explained in the Properties menu, this can cause the query to execute slowly since it will wait until BigQuery resources are available. Switching this setting to Interactive will make the executions run faster.
Board editor screen (Properties - Query priority level)
Click the save button in the upper right once you've finished configuring the settings.
That's it! The Flow is done and should be executing!

2. Executing the Flow

Since we set the Flow to execute on a timed schedule, it should now be executing every three minutes. Let's check to make sure.
Open up the Log display at the bottom of the Board editor. Logs of each Flow execution will be output here.
Board editor screen (confirm logs)
Is your Flow executing every three minutes?

3. Confirm results

Next, let's go to the BigQuery console to make sure the results are being output into the table properly.
Contents of avg_by_sex
Looks like the table we set, avg_by_sex, was created, and the averages in each subject for male and female students are there!

4. Stopping Flow execution

Once you've finished, open the Start of Flow BLOCK's properties and uncheck Activating start times. Then, click Save and the Flow will stop executing.
Board editor screen (Properties - disable timed execution of Flows)
How was that? This was just a simple example, but we hope you can see how BLOCKS allows for creating and executing batches for BigQuery without the need for complicated programming.
There are still lots of other BLOCKS available for use, so try testing some out yourself by incorporating them into your Flows!