Model Generator How-To: Regression

Demand forecasting with the regression model

Introduction

With the Model Generator, you can use past data like the weather to train a regression model to predict values like sales, attendee numbers, and more.

Things to know before starting

Input variables

In this guide, we’ll use data like the weather and day of the week to train a model to predict electricity demand. We refer to the various items of data used to train the model as input variables, or predictors. It’s important to select input variables that are relevant to whatever you are trying to predict. Good inputs will result in accurate predictions, while poor inputs can lead to a completely inaccurate model.

So, if you were going to train a regression model to predict future sales, you should first consider things like why do people buy this product? You may find it necessary to analyze your data to look for relevant input variables. If you can find good input variables, the Model Generator should be able to find connections between the input data and sales numbers, and create an accurate model for predicting results.

Benchmarks

When selecting your input variables, it’s also important to set benchmarks, or expected accuracy levels. By doing this, you can better evaluate your model’s predictions results and determine how well its accuracy is improving.

The Model Generator evaluates a regression model’s accuracy, or error rate, by comparing predicted results versus the actual results using the Root Mean Square Error (RMSE) method.

The RMSE formula is as follows:

RMSE

The RMSE is scale-dependent, meaning that a model that predicts for larger values will likely have a higher error rate, and vice versa.

About this guide’s scenario

In this guide, we’ll use actual data from an electric company to train a model for predicting electricity demand.

So what input variables might be important for predicting electricity usage?

Since we often use electricity for things like air-conditioning, heating, and dehumidifying, we can guess that temperature and humidity might be relevant. We also tend to use more lights at night, so the number of daylight hours is probably important, too.

With that in mind, we’ll try out training a model using temperature, humidity, and daylight hours as input variables.

Later, as we try to improve the model, we could consider adding input variables for things like holidays or big events that might have an effect on electricity demand. Or, we could focus out model on a smaller geographical region and add more data specific to that place.

Making a predictive model with the Model Generator

This guide will follow these basic steps to train a model and make predictions:

  1. Select a model: We’ll choose the regression model since we want to predict a level future demand. We will continue to add more model types to the Model Generator in future releases.

  2. Consider input variables: Selecting good input variables is the core or creating a good predictive model. We’ll use temperature, humidity, and daylight hours for this guide.

  3. Create the training data: Once we’ve selected our input variables, we’ll prepare the training data. Our training data will include our input variables as well as actual electricity usage numbers. The Model Generator will analyze the training data to determine the relationship between the inputs variables and the results values (electricity usage).

  4. Train the model: With the training data prepared, we can use the Model Generator to train a predictive model. During a training, the Model Generator can run multiple trials and automatically adjust it’s parameters to improve its accuracy.

  5. Evalutate the model: Once we train a model, we can use it from the Flow Designer service to make predictions. We’ll evaluate our model by predicting for data that we didn’t use during the training. If the model isn’t accurate or doesn’t meet our benchmarks, we can reevaluate our input variables, how we present them to the Model Generator, and the like.

  6. Use the model: Once you have an accurate trained model, you do all kinds of things with it. You can use it to make predictions from a Flow Designer, or use the Data Bucket service to continously collect data for your input variables and retrain a more accurate model. With BLOCKS, you can create a complete, powerful, and cost-efficient Machine Learning system without the need for programming a complicated infrastructure.

Preparing the data

About the data

We’ll use publicly available weather, daylight, and electricity usage data to create our training data. This data is available from the following sources:

  1. Weather data: The Japan Meteorological Agency
  2. Daylight hours data: The National Astronomical Observatory of Japan
  3. Electricity usage: TEPCO Power Grid, Inc.

We’ll train the model using data from 2016, then evaluate its accuracy by making predictions for actual 2017 data. There may be times when accurate predictor data is not available and you will need to create a model to estimate that data, but we don’t need to worry about that this time.

We’ll use the following five types of data for our input variables:

  • High temperature
  • Low temperature
  • Sunlight hours
  • Average humidity
  • Daytime hours (number of hours between sunrise and sunset)

Our results value, or the value we will predict for, will be as follows:

  • Electricity usage (daily)

The Model Generator will look for connections between the input variables and the results value to create a predictive model. If this goes well, we can then give just weather, sunlight, humidity, and daytime hours data to the model to predict for electricity usage.

Getting the input variable data

Weather data

First, we’ll get the weather data from the Japan Meteorological Agency . We’ve prepared this data for you, which you can download below:


weather_tokyo_utf8.csv

A CSV file containing the date, day of the week, daily high temperature, low temperature, sunlight hours, and average humidity for the Tokyo region from April 1st, 2016 through January 4th, 2017.

Click the link to the left to download the data. If clicking the link opens the data in a web page rather than downloading it, try right clicking the link and selecting Save Link As...

We’ll use weather data for the Tokyo region in this sample. However, this may lead to some discrepancies in our model, since our electricity data covers TEPCO’s entire power grid. While making the areas consistent across all of our data could lead to a more accurate model, we’ll just use these data sets to get started.

Sunrise and sunset data

We’ll use data from the National Astronomical Observatory of Japan . We’ve prepared this data for you to download below. We’ll use this data later to calculate the number of daytime hours each day.


sunrise_sunset_2016.csv

A CSV file of Tokyo-area sunrise and sunset times from Apr. 1, 2016 through Dec. 31, 2016.

Click the link to the left to download the file. If clicking the link just brings up a web page of the sunrise/sunset data instead of downloading it, try right-clicking the link and clicking Save Link As…


sunrise_sunset_2017.csv

A CSV file of Tokyo-area sunrise and sunset times from Jan. 1, 2017 through Jan. 4, 2017.

Click the link to the left to download the file. If clicking the link just brings up a web page of the sunrise/sunset data instead of downloading it, try right-clicking the link and clicking Save Link As…

The sunrise and sunset data are Copyrighted Works of the National Astronomical Observatory of Japan and subject to their Terms of Use

Electricity usage data

You can download the electricity usage data from the TEPCO Power Grid website. We’ll use the data from 2016 and 2017 for this guide.

To download CSV files of the data, right-click the 2016 and 2017 links and click Save Link As….

About training data

Once you’ve downloaded all of the initial data, we’ll use it to prepare the training data for the Model Generator. The Model Generator needs the following two data files to train and optimize a model:

  1. Training set: The data used to train the model.
  2. Validation set Data not used to train the mode. Rather, the Model Generator uses this data to verify the model’s accuracy and determine how well it can predict for unknown data.

Both sets need to be CSV files formatted with columns for each input variable and the right-most column containing the results variable. In this case, our results variable is the electricity usage data. The results variable must be a numerical value.

We’ll arrange our training data in the following order, with the electricity data in the last column:

Daily high temperature, low temperature, sunlight hours, average humidity, daytime hours, electricity usage

For Full Service Plan users

You will need to upload your two training data files to the Google Cloud Storage (GCS) location prepared by BLOCKS. You’ll need a Google account to access GCS.

If you have a Google account, you’ll need to register it into the GCP access section of the project settings menu.

If you don’t have a Google account yet, you can refer to Creating a Google account, then register your new account into the GCP access section of the project settings menu.

For Self-Service Plan users

Create a bucket in GCS and store the two files into it. We used a bucket named gs://magellan-blocks-doc-train/ to store our files when making this guide, but you will need to choose your own bucket name if following along.

GCS buckets used with the Model Generator should have the default storage class set to Regional and the location set to us-central1. We cannot guarantee the Model Generator will function properly with buckets with different settings.

How to create the training data

The simplest way to prepare the training data is to use a spreadsheet program like Google Sheets or Excel to align each input variable into a column with the results variable in the final column. Then you can save the spreadsheet as a CSV file.

However, this method may not be practical when you’re using large amounts of data or if you want to automate things more efficiently.

Even though the data used in this guide isn’t all that large, we’ll introduce how to use the Flow Designer to process and prepare our training data. With the Flow Designer, you can create automated systems for processing big data efficiently.

If you’d prefer to just use a spreadsheet for now, you can feel free to do so!

Uploading the initial data

First, we need to upload the initial data into GCS so we can use it in the Flow Designer. We mentioned earlier that the training data needs to be uploaded into GCS, but we’ll also upload the initial data to GCS so we can process it into our training data.

We’ve used the following bucket and file names in this guide:

  • GCS bucket: magellan-blocks-doc-data
  • Data files:
    • Weather: weather_tokyo_utf8.csv
    • Daytime hours (2016): sunrise_sunset_2016.csv
    • Daytime hours (2017): sunrise_sunset_2017.csv
    • Electricity usage (2016): juyo-2016.csv
    • Electricity usage (2017): juyo-2017.csv

You can change the file names if you like, but make sure that you replace our bucket name with your bucket name when following along this guide.

The data needs to be encoded as UTF-8. We’ve already prepared the data this way, but you’ll need to convert weather data from the Japan Meteorological Agency if you download it directly from their website.

We recommend using the GCS Explorer to upload the data. Please refer to Uploading files to GCS from the Basic Guide and upload each of the previously mentioned data files to GCS.

When doing that, please follow the rules written for selecting the bucket you will upload the data into:

  • Full Service Plan users:

    Select the bucket that ends with -data.

  • Self-Service Plan and Free Trial users:
    • If you created the default buckets along with your project, select the bucket that ends with -data.
    • If you already have an appropriate bucket ready, select that bucket.
    • If you don’t have an appropriate bucket ready, create a new one.

Once you’ve uploaded all the data into GCS, it should be available for BLOCKS to access and use. Next, we’ll move on to processing the data.

Sending the data to the Flow Designer

We’ll use the Flow Designer to process our data with Google’s BigQuery service. BigQuery can quickly process unlimited amounts of data at a reasonable cost. With BLOCKS, using BigQuery is simple. You can do all sorts of data analysis with BigQuery besides just preparing data for the Model Generator, so we suggest trying out some of the other BigQuery-related features in the Flow Designer when you have the chance.

We’ve uploaded our initial data into GCS, so now we’ll use the Flow Designer to import the data into BigQuery tables for processing.

In the Flow Designer, you can connect various BLOCKs into processing units called Flows.

A Flow always starts with a Start of Flow BLOCK and ends with an End of Flow BLOCK. These two BLOCKs can be found in the Basic section of the BLOCK list.

Within the Start of Flow BLOCK’s properties, you can set an auto-execution schedule or press a button to manually execute the Flow. You can also assign a Flow ID, then use this ID to execute the Flow from an external application, or connect it with your company’s systems.

Importing the weather data into BigQuery

We’ll upload the weather data into BigQuery with the Load to single table from GCS BLOCK from the BigQuery section of the BLOCK list.

Create a Flow like the one shown below with a Start of Flow BLOCK, Load to single table from GCS BLOCK, and an End of Flow BLOCK.

Importing the weather data to BigQuery

If you wish, you can change the name displayed on any BLOCK by changing the BLOCK name setting in its properties.

Configure the properties of the Load to single table from GCS BLOCK as shown in the following chart:

Property Value
Source file GCS URL

gs://magellan-blocks-doc-data/weather_tokyo_utf8.csv

Replace the bucket name portion ( magellan-blocks-doc-data) with your own bucket’s name.

Destination dataset blocks_ml_juyo
Destination table tokyo_city_20160401_20170104_raw
Schema settings
date STRING NULLABLE
weekday STRING NULLABLE
max_temp STRING NULLABLE
min_temp STRING NULLABLE
hours_of_sunshine STRING NULLABLE
other STRING NULLABLE
humidity_avg STRING NULLABLE

You enter schema settings quickly by clicking Edit as JSON and copying in the following JSON text:

[
 {
  "name": "date",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "weekday",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "max_temp",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "min_temp",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "hours_of_sunshine",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "other",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "humidity_avg",
  "type": "STRING",
  "mode": "NULLABLE"
 }
]
In cases on non-empty tables Overwrite
File format CSV
CSV delimiter character Comma
Number of skipped rows

6

The weather data CSV file contains 6 header rows (rows at the top that don’t contain data). Set this to 0 for files that don’t have any header rows.

Click Save at the top of the Flow Designer. You can’t execute Flows until they have been saved.

Open the properties menu of the Start of Flow button.

The Flow Designer outputs Flow execution logs to the Logs section at the bottom of the screen. Refer to the Logs section of the Basic Guide for more details.

If you run into an error, review the properties for each BLOCK and try saving and executing the Flow again.

We’ll follow the same process to import the rest of our data with the following Flows.

Importing the sunrise/sunset data into BigQuery

Next, we’ll import the sunrise and sunset data into BigQuery. Create a Flow with a Start of Flow BLOCK, two Load to single table from GCS BLOCKS, and an End of Flow BLOCK.

Flow for importing sunrise/sunset data into BigQuery

Configure the properties for the first Load to single table from GCS BLOCK as follows to import the 2016 data:

Property Value
Source data GCS URL

gs://magellan-blocks-doc-data/sunrise_sunset_2016.csv

Replace the bucket name portion ( magellan-blocks-doc-data) with your own bucket’s name.

Destination dataset blocks_ml_juyo
Destination table tokyo_city_sunrise_sunset_2016_raw
Schema settings
date STRING NULLABLE
sunrise_time STRING NULLABLE
sunrise_orientation STRING NULLABLE
meridian_passage STRING NULLABLE
altitude STRING NULLABLE
sunset_time STRING NULLABLE
sunset_orientation STRING NULLABLE

You enter schema settings quickly by clicking Edit as JSON and copying in the following JSON text:

[
 {
  "name": "date",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "sunrise_time",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "sunrise_orientation",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "meridian_passage",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "altitude",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "sunset_time",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "sunset_orientation",
  "type": "STRING",
  "mode": "NULLABLE"
 }
]
In cases of non-empty tables Overwrite
File format CSV
CSV delimiter character Comma
Number of skipped rows

1

The sunrise/sunset data CSV file contains 1 header row (rows at the top that don’t contain data). Set this to 0 for files that don’t have any header rows.

Configure the next Load to single table from GCS BLOCK with the same settings, but change the 2016 in Source file GCS URL and Destination table properties to 2017.

Once finished, click Save and execute the Flow.

Importing the electricity data into BigQuery

We’ll create another similar Flow to import the electricity usage data into BigQuery:

Importing the electricity data into BigQuery

Configure the first Load to single table from GCS BLOCK as follows:

Property Value
Source file GCS URL

gs://magellan-blocks-doc-data/juyo-2016.csv

Replace the bucket name portion (magellan-blocks-doc-data) with your own bucket’s name.

Destination dataset blocks_ml_juyo
Destination table juyou_2016_raw
Schema settings
date STRING NULLABLE
time STRING NULLABLE
used STRING NULLABLE

You enter schema settings quickly by clicking Edit as JSON and copying in the following JSON text:

[
 {
  "name": "date",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "time",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "used",
  "type": "STRING",
  "mode": "NULLABLE"
 }
]
In cases on non-empty tables Overwrite
File format CSV
CSV delimiter character Comma
Number of skipped rows

3

The electricity data CSV file contains 3 header rows (rows at the top that don’t contain data). Set this to 0 for files that don’t have any header rows.

Configure the next Load to single table from GCS BLOCK with the same settings, but change the 2016 in Source file GCS URL and Destination table properties to 2017.

Once finished, click Save and execute the Flow.

Creating the training data

Now that all of the initial data is loaded into BigQuery, we’ll use it to prepare the training data for the Model Generator.

However, since each set of data is formatted differently, we need to standardize the data by doing the following:

  1. Standardize the date formatting: We’ll make it so all single digit months and dates have a 0 in their tens places.

  2. Standardize the scope: The electricity usage CSV contains hourly data, while each of the other CSV files contains daily records. We’ll convert the hourly electricity usage data into daily totals.

  3. Join together the data by date: We’ll combine all of the data and arrange it by date.

The Flow to create the training data is a bit longer than the previous Flows, but still just involves connecting each BLOCK.

As before, start by placing a Start of Flow BLOCK. Then, place a Parallel branch BLOCK so we can process two things simultaneously.

Joining the data Flow
Standardizing the data formatting
The weather data

First, we’ll use an Execute query BLOCK to standardize the way the date is written in the weather data. We named this BLOCK Execute query (unify weather data key) in the image above. Set its properties as follows:

Property Value
SQL syntax Standard SQL
Query
SELECT  REGEXP_REPLACE(REGEXP_REPLACE(date,"(/)([0-9]{1})($)","\\10\\2\\3"),"(/)([0-9]{1})(/)","\\10\\2\\3") AS date,
  max_temp,
  min_temp,
  hours_of_sunshine,
  humidity_avg
FROM
  blocks_ml_juyo.tokyo_city_20160401_20170104_raw
ORDER BY
  date
Dataset for storing results blocks_ml_juyo
Table for storing results weather_2016_2017
In cases of non-empty tables Overwrite
Standardizing the electricity data

The electricity usage data is a bit more complicated since we’ll also need to convert the hourly data into daily data.

Doing everything at once could be difficult, so we’ll simplify things by breaking it down into two steps with two separate BLOCKS.

Standardizing the date format

First, we’ll standardize the electricity usage data’s date formatting with an Execute query BLOCK. We’ve named this BLOCK Execute query (unify elec. data key) in the image above.

Set its properties as follows:

Property Value
SQL syntax Standard SQL
Query
SELECT  REGEXP_REPLACE(REGEXP_REPLACE(date,"(/)([0-9]{1})($)","\\10\\2\\3"),"(/)([0-9]{1})(/)","\\10\\2\\3") AS date,
  time,
  used
FROM
  blocks_ml_juyo.juyou_2016_raw
ORDER BY
  date,
  time
Dataset for storing results blocks_ml_juyo
Table for storing results juyou_2016_tmp
In cases of non-empty tables Overwrite
Converting hourly data to daily data

Next, we’ll convert the hourly data into daily data using another Execute query BLOCK. We named this BLOCK Execute query (convert to daily) in the image above.

Configure its properties as follows:

Property Value
SQL syntax Standard SQL
Query
SELECT
  date,
  SUM(cast(used as int64)) as used_sum
FROM
  blocks_ml_juyo.juyou_2016_tmp
GROUP BY
  date
ORDER BY
  date
Dataset for storing results blocks_ml_juyo
Table for storing results juyou_2016
In cases of non-empty tables Overwrite

Joining the data together by date

Lastly, we’ll use another Execute query BLOCK to join all of our data together and arrange it by date. We named this BLOCK Execute query (join data together) in the image above.

Configure its properties as follows:

Property Value
SQL syntax Standard SQL
Query
SELECT
  juyo.date,
  weather.max_temp,
  weather.min_temp,
  weather.hours_of_sunshine,
  weather.humidity_avg,
  daytime.sunrise_time,
  daytime.sunset_time,
  juyo.used_sum
FROM
  blocks_ml_juyo.juyou_2016 AS juyo
LEFT JOIN
  blocks_ml_juyo.weather_2016_2017 AS weather
ON
  juyo.date = weather.date
LEFT JOIN
  blocks_ml_juyo.tokyo_city_sunrise_sunset_2016_raw AS daytime
ON
  weather.date = daytime.date
ORDER BY
  date
Dataset for storing results blocks_ml_juyo
Table for storing results juyo_all_data_2016
In cases of non-empty tables Overwrite

Once finished, click Save and execute the Flow.

Cleansing the data

Next we need to cleanse the data and remove any irregularities. For example, the weather data contains “null” values where the machinery might have been broken or otherwise unable to make measurements. If we leave these values in the data, it might affect the accuracy of the Model Generator as it looks for patterns in the data.

We’ll cleanse the data with the following Flow:

Data cleansing Flow

Some of the average humidity data contains the value “null”. We’ll use an Execute query BLOCK to remove any null values from the data. Configure its properties as follows:

Property Value
SQL syntax Standard SQL
Query
SELECT
  *
FROM
  blocks_ml_juyo.juyo_all_data_2016
WHERE
  date != "null"
  AND used_sum IS NOT NULL
  AND max_temp != "null"
  AND min_temp != "null"
  AND hours_of_sunshine != "null"
  AND humidity_avg != "null"
  AND sunrise_time != "null"
  AND sunset_time != "null"
ORDER BY
  date
Dataset for storing results blocks_ml_juyo
Table for storing results juyo_all_data_2016_no_error
In cases of non-empty tables Overwrite
Standardizing data types

Next we’ll standardize the data types using another Execute query BLOCK. This will make it simple to compile our data. If an error occurs during conversion, there might still be irregular values in the data. In that case, please take another look through the data.

Property Value
SQL syntax Standard SQL
Query
SELECT
  date,
  used_sum,
  CAST(max_temp AS float64) AS max_temp,
  CAST(min_temp AS float64) AS min_temp,
  CAST(hours_of_sunshine AS float64) AS hours_of_sunshine,
  CAST(humidity_avg AS float64 ) AS humidity_avg,
  PARSE_TIMESTAMP("%%R", sunrise_time ) AS sunrise_time,
  PARSE_TIMESTAMP("%%R", sunset_time ) AS sunset_time
FROM
  blocks_ml_juyo.juyo_all_data_2016_no_error
Dataset for storing results blocks_ml_juyo
Table for storing results juyo_all_data_2016_cast
In cases of non-empty tables Overwrite
Calculating daytime hours

Next, we’ll calculate number of daytime hours from the sunrise and sunset data using another Execute query BLOCK.

Property Value
SQL syntax Standard SQL
Query
SELECT
  date,
  max_temp,
  min_temp,
  hours_of_sunshine,
  humidity_avg,
  TIMESTAMP_DIFF(sunset_time,sunrise_time,minute) AS daytime_time,
  used_sum
FROM
  blocks_ml_juyo.juyo_all_data_2016_cast
ORDER BY
  date
Dataset for storing results blocks_ml_juyo
Table for storing results juyo_all_data_2016_daytime
In cases of non-empty tables Overwrite

Once finished, click Save and execute the Flow.

Creating the training and validation sets

With our initial data now ready, we’ll create the training data for the Model Generator.

The training data consists of two parts: the training set and the validation set. The Model Generator analyzes the training set to create a predictive model. Then, it evaluates its model with the validation set. With these two sets of data, the Model Generator can “learn” to accurately predict for the results value.

One thing we need to be careful about is that our data is currently arranged in chronological order. If we were to only use data from part of the year in the training set, the Model Generator might find patterns specific to that season. This might result in a model that isn't good at predicting values throughout the year.

To avoid this, we need to randomize the order of our data before we split it into the training set and validation set. To do this, we’ll assign a random value to each record, then order the data according to those values. Once we reorder the data, we’ll split it into the two sets at an 8:2 ratio.

Flow for creating the training data

We’ll assign the random values using an Execute query BLOCK.

Property Value
SQL syntax Standard SQL
Query
WITH
  tmp AS (
  SELECT
    date,
    used_sum,
    max_temp,
    min_temp,
    hours_of_sunshine,
    humidity_avg,
    daytime_time,
    RAND() AS rand
  FROM
    blocks_ml_juyo.juyo_all_data_2016_daytime)
SELECT
  date,
  used_sum,
  max_temp,
  min_temp,
  hours_of_sunshine,
  humidity_avg,
  daytime_time,
  ROW_NUMBER() OVER (ORDER BY rand) AS row_num
FROM
  tmp
order by row_num
Dataset for storing results blocks_ml_juyo
Table for storing results juyo_all_data_2016_no
In cases of non-empty tables Overwrite

Next, we’ll use another Execute query BLOCK to create the training set as a BigQuery table. Configure its properties as follows:

Property Value
SQL syntax Standard SQL
Query
SELECT
  max_temp,
  min_temp,
  hours_of_sunshine,
  humidity_avg,
  daytime_time,
  used_sum
FROM
  blocks_ml_juyo.juyo_all_data_2016_no
WHERE
  row_num <= 219

row_num <= 219 will differ depending on the number of rows in the data. Since we are using data with 274 rows in this example, calculating an 8:2 split for the training and validation sets resulted in 219.

Dataset for storing results blocks_ml_juyo
Table for storing results train_data
In cases of non-empty tables Overwrite

We’ll use an Export single table to GCS BLOCK to export the training set to GCS as a CSV file.

Property Value
Destination file GCS URL

gs://magellan-blocks-doc-train/train_data.csv

Replace the bucket name portion (magellan-blocks-doc-train) with your own bucket’s name.

Source dataset blocks_ml_juyo
Source table train_data
Output header line Remove the check

Next, we’ll use another Execute query BLOCK to create the validation set as a BigQuery table. Configure it as follows:

Property Value
SQL syntax Standard SQL
Query
SELECT
  max_temp,
  min_temp,
  hours_of_sunshine,
  humidity_avg,
  daytime_time,
   used_sum
FROM
  blocks_ml_juyo.juyo_all_data_2016_no
WHERE
  row_num > 219

row_num > 219 will differ depending on the number of rows in the data. Since we are using data with 274 rows in this example, calculating an 8:2 split for the training and validation sets resulted in 219.

Dataset for storing results blocks_ml_juyo
Table for storing results eval_data
In cases of non-empty tables Overwrite

We’ll use an Export single table to GCS BLOCK to export the validation set to GCS as a CSV file.

Property Value
Destination file GCS URL

gs://magellan-blocks-doc-train/eval_data.csv

Replace the bucket name portion (magellan-blocks-doc-train) with your own bucket’s name.

Source dataset blocks_ml_juyo
Source table eval_data
Output header line Remove the check

Once finished, click Save and execute the Flow.

Training the model

With the data prepared, we’re ready to start training the model. With the Model Generator, this is actually incredibly simple. This means you can focus more of your efforts on selecting relevant input variables and preparing good data.

Creating the Model Generator

If you haven’t made any Model Generators yet, click Start on the What is the Model Generator? screen.

What is the Model Generator?

If you already have at least one Model Generator, click Add from the top of the Model Generator list.

The Model Generator list

info_outline A message will appear if your organization does not have sufficient licenses to create the Model Generator. If you are an admin for your organization, you will be prompted with instructions on how to to purchase an additional license. If you are not an admin, you will be prompted to contact your admins.

Select the Regression type.

Selecting the model type

Enter a name for your Model Generator.

Naming the Model Generator (regression)

The following step is for Self-Service Plan users only

BLOCKS uses Google’s Cloud Machine Learning Engine, so you will need to have registered a GCP service account to use the Model Generator. Select your GCP account and enable any required APIs that don’t have a checkmark.

GCP service account settings (regression model)

The following step is for Self-Service Plan users only

Once the training completes, the trained model will be stored into GCS. As such, you’ll need to configure the GCS destination that will store the results.

This bucket must be set to the default storage class Regional and the location us-central1. We cannot guarantee that the Model Generator will function properly for buckets with other settings.

Model Generator storage settings (regression model)

Next, you need to configure settings for your training data. Enter the name, type, and dimensions for each item in the data.

Make sure that the results value (electricity usage in this case) is always the last column in your actual training data. When you configure your training data during this step of the Model Generator setup, only enter information for the input variables. Do not enter information for the results variable.

Model Generator training data settings (regression model)

Configure your training data settings as follows:

Item name Type Dimensions Explanation
max_temp Numerical value 1 Daily high temperature
min_temp Numerical value 1 Daily low temperature
hours_of_sunshine Numerical value 1 Number of hours the sun was shining each day
humidity_avg Numerical value 1 Daily average humidity
daytime_time Numerical value 1 Hours between sunrise and sunset each day

Set Output dimensions to 1.

Review all of your settings on the confirmation screen, then click Finish to create the Model Generator.

Model Generator confirmation screen

Starting a training

Once the Model Generator is created, we can start training the regression model.

Model Generator details screen

Click Start Training.

Starting a training
  • Enter a name for the training.
  • For the Training data upload method, select Prepare separate training set and validation set files.
  • Enter the GCS URLs for your training set and validation set files:
    • Example training set URL: magellan-blocks-doc-train/train_data.csv
    • Example validation set URL: magellan-blocks-doc-train/eval_data.csv

    GCS URLs can only contain letters, numbers, underscores (_), hyphens (-), or slashes (/).

    Replace the bucket name portion (magellan-blocks-doc-train) with your own bucket name.

  • Set Number of CSV header rows to skip to 0.
  • If desired, you can enter a time for the Max. time until timeout (minutes).
  • Set the Max. number of trials. By setting more than one trial, the Model Generator can automatically adjust its parameters until it finds the most accurate tuning.
  • Select whether to Enable or Disable automatic early stopping. Enable this option to have the training stop before the end of the specified training time if the Model Generator determines that accuracy is unlikely to improve. While this can reduce unnecessary training time, please be aware that accurately determining early stopping cannot be guaranteed.
  • Set the Machine type as either Single node, Single node (GPU), or Distributed nodes.
    • Single node: Uses the standard machine to run the training.
    • Single node (GPU): Runs the training using a GPU (Graphic Processing Unit) for generally faster results than the standard single node type. However, GCP fees will cost approximately three times as much. Depending on the training data, the speed may not be significantly faster, or may be slower in some cases.
    • Distributed nodes: Runs the training using multiple machines.

Click Start to start training the model. You can check on its progress from the training details list.

Training details

Applying the trained model

Once the training finishes, it’s status should change to Succeeded and an Apply button should appear. Select whether to apply to model to a Testing or Production setting and click Apply.

Please wait a short while after clicking the apply button before running any predictions, as running them immediately may cause errors in the current version. If an error does occur, wait another few minutes and try running your prediction again.

Making predictions

Once the training is complete and the model has been applied, we can use it to make predictions.

Making a predictive Flow

We can use our model to make predictions with the Flow Designer with the following Flow:

Predictive Flow (regression model)

After the Start of Flow BLOCK, place a Construct object BLOCK. We’ll configure input variable data for days we want to predict electricity usage for into this BLOCK.

Property Value
Results variable _ (The default setting)
Data
Construct object BLOCK data property example (regression model)

In the above image, we’ve entered data for 4 days that we will predict electricity usage for. data.0, data.1 , data.2, and data.3 each contain the input variable data for a day. To enter multiple set of data like this into the Construct object BLOCK, use the Array type and click the + button next to Array to create elements for data.0, data.1, data.2, and data.3.

Set each of these elements to the Object type. Use the + button next to Object to add a set of data that includes an identifier (key), the high temperature (max_temp), the low temperature (min_temp), the hours of sunshine (hours_of_sunshine), and the hours of daytime (daytime_time).

Connect a Model Generator prediction (online) BLOCK (from the Machine Learning category) to the Flow and configure it as follows:

Properties Value
Model used for predictions Select the model you trained in the Model Generator
Input variable _.data

There are many different ways to output the prediction results, but we’ll just load them into a BigQuery table using the Load to table from variable BLOCK from the BigQuery category of the BLOCK List. Set its properties as follows:

Property Value
Source data variable _.predictions
Destination dataset blocks_ml_juyo
Destination table predict_result
Schema settings
output FLOAT NULLABLE
key STRING NULLABLE

You can enter the schema settings more quickly by clicking Edit as JSON and copying in the following JSON text:

[
 {
  "name": "output",
  "type": "FLOAT",
  "mode": "NULLABLE"
 },
 {
  "name": "key",
  "type": "STRING",
  "mode": "NULLABLE"
 }
]
In cases of non-empty tables Overwrite
File format NEWLINE_DELIMITED_JSON

If you also want to get a file of the results, you can use the Export single table to GCS BLOCK from the BigQuery section of the BLOCK list.

Property Value
Destination file GCS URL

gs://magellan-blocks-doc-data/predict_result.json

Replace the bucket name portion (magellan-blocks-doc-data) with your own bucket’s name.

Source dataset blocks_ml_juyo
Source table predict_result
File format NEWLINE_DELIMITED_JSON
Running the prediction

Once the Flow is ready, click Save then execute the Flow by clicking the button within the Start of Flow BLOCK’s properties.

The results will be saved into GCS as a JSON file like the following:

Filename: gs://magellan-blocks-doc-data/predict_result.json

Replace the bucket name portion (magellan-blocks-doc-data) with your bucket’s name.

{"output":84442.0390625,"key":"20170101"}
{"output":81434.53125,"key":"20170102"}
{"output":78059.7421875,"key":"20170103"}
{"output":75184.3125,"key":"20170104"}

"output" contains the predicted electricity usage level.

Refer to Predicting with the Model Generator prediction (online) BLOCK for instructions on other methods of making predictions with the Model Generator prediction (online) BLOCK other than the one showcased in this guide.

Evaluating the model

Once you can make predictions with a model, you can evaluate its results. If you set benchmarks beforehand, you can compare your results against your benchmarks.

We didn’t set any benchmarks in this guide, but we’ll still think about ways we could improve the model in the following sections.

Increasing the amount of data

We only used data from 2016 to train the model in this example, so we might be able to achieve more accurate results if we increased number of years in our training data.

Unfortunately, there was only one year’s worth of data publically available from TEPCO when we created this guide. If possible, we think having at least 3 years of data would help improve the accuracy of this kind of model.

Increasing the length and number of trials

Increasing the number and length of trials during the training can also affect accuracy.

By increasing the number of trials the Model Generator runs, we can give it more opportunities to adjust its parameters and find ones that give more accurate results.

Similarly, increasing the limit on how long each trial can run can give the Model Generator more time for its calculations, which could lead to higher accuracy.

Running more trials and allowing more time per trial can be helpful, but it also makes the training take that much longer. You'll have to decide on the balance that works best for your situation.

Reviewing the input variables

What else might affect electricity usage? Maybe using data about the type of day could have an influence. For example, perhaps more electricity is used on weekdays than weekends, since that’s when most business is done. Along the same lines, maybe people use a different amount of electricity on public holidays.

As an example, we’ve tried adding these new input variables and training a new model.

The standard error, or RMSE, of the original model we trained in this guide was about 80,000,000 kw.

This may sound like a large error rate, but as explained in the benchmarks section, the RMSE formula is scale-dependent. Electricity usage rates in the data range between 600,000,000 kw and 900,000,000 kw, so the average prediction being off by about 80,000,000 kw doesn’t seem too bad.

When we added the new input variables for the day of the week and holidays, the CSV file looked like the following:

Revised training data

The sixth column shows the day of the week starting from Monday as 0 through Sunday as 6. The seventh column shows a 0 for a non-holiday and a 1 for holidays.

Next, we created a Model Generator with the revised training data settings.

For the day of the week data, we selected the Days data type. For the holidays, we selected Strings (enumerated) for the data type, then entered 0,1 into the Keyword list setting.

With the newly configured Model Generator, we could follow the same steps as before to train the model and make predictions.

So how did the new model perform?

By adding the new input variables, the RMSE went dropped to approximately 45,000,000 kw, or about half that of the original model, which is nearly twice as accurate as before!

Summary

Finding input variables that will result in an accurate model often comes down to asking yourself very basic questions. Things like when do people use electricity? or why do people buy this product?

As such, you'll need to be well versed with the runnings of your business to find the best input variables.

This ties into one of the reasons we’re trying at Groovenauts to make Machine Learning accessible to everyone: so that business leaders—and not just programmers—can take a leading role in using Machine Learning.

We’re excited to see the kinds of models you make with the Model Generator.