ML Board How-To Guide: Regression Type

ML Board How-To Guide

Forecast demand with the regression model

Introduction

The ML Board’s regression model predicts values based on existing data. For example, weather data for various dates could be used to predict how many people will attend an event, sales numbers, or public transportation usage.

Things to know before starting

Input variables

We’ll refer to data used as the basis to make the prediction as the input variables. If weather data was used to predict future sales, the weather data would be an input variable. It’s crucial to select inputs that are relevant to whatever you are trying to predict. Good inputs will result in accurate predictions, while poor inputs can lead to a completely inaccurate model.

So, if you were going to used the regression model to predict future sales, you should first carefully consider things like why people would buy the product. If necessary, take time to analyze your data to find relevant input variables. If you can do that, the ML Board should be able to find connections between the input data and sales numbers, improving the accuracy of its predictions.

Benchmarks

When selecting input variables, it’s also important to set benchmarks. By doing so, we can evaluate the accuracy of our predictions and determine whether or not they are improving.

A prediction’s accuracy is evaluated by comparing its results against a set of actual results. This is often done using the Root Mean Square Error (RMSE) method.

The RMSE formula is shown below. It is scale-dependent, so smaller values will result in a lower deviation, and vice-versa.

RMSE

About this sample's situation

In this sample, we’ll use actual data from an electric company to create a predictive model for future electricity usage.

So what input variables might be relevant for estimating electricity usage?

Since we often use electricity for things like air-conditioning, heating, and dehumidifying, we can guess that temperature and humidity might be relevant.

We also tend to use more lights at night, so the number of daylight hours is probably important, too.

Continuing from our thinking above, we’ll try using temperature, humidity, and the number of hours between sunrise and sunset as our input variables.

Later, as we try to improve our model, we might consider adding data about things like holidays, or big events like the Olympics that might have an effect. Or, we could focus our prediction on a smaller region and add more data specific to that place.

Making a predictive model with the ML Board

Before we start actually making an ML Board, we’ll explain the basic process below.

  1. Choosing a model: We’ll choose the regression model since we want to predict a level future demand. Additional models will also be added to the ML Board in the future.

  2. Considering inputs: Deciding what to use as input variables is the core of creating a good predictive model. We’ll use temperature, humidity, and daylight hours for our inputs in this sample.

  3. Creating the training data: Once we’ve selected our inputs, we turn these into training data. The ML Board will analyze this data for patterns that lead to accurate predictions. Our training data will contain information for temperature, humidity, and daylight hours, as well as actual electricity usage numbers. By giving it this data, the ML Board will calculate a relationship between the input variables and electricity usage. With that, we can give it future temperature, humidity, and daylight hours and it can predict electricity usage.

  4. Training: With the data prepared, we’ll start training the ML Board. It performs automatic optimization, which through running several patterns will lead to a more accurate trained model.

  5. Evaluation: Once the training is complete, we can make a prediction using our model in a BLOCKS Big Data Board. By using our model on data that wasn’t used to train it, we can evaluate how accurately it predicts against actual results. If it’s not accurate enough, we might need to reconsider what our input variables, how we’re giving these to the ML Board, and the like.

  6. Utilization: Now that we’ve finished creating a trained model, we can use it to make predictions on a BLOCKS Big Data Board. Or, we can use a BLOCKS IoT Board to gather continuous data for our input variables and improve the accuracy of our predictions. By using BLOCKS, the need to program system infrastructure is removed, allowing us to do these things and more in a cost-effective and efficient manner.

Creating data for machine learning

About the data

For our predictive model of electricity usage, we’ve decided to use weather and daylight data for our input variable. Fortunately, this data is available publicly online.

  1. Weather data: Japan Meteorological Agency
  2. Daylight hours: National Astronomical Observatory of Japan
  3. Electricity usage: TEPCO Power Grid, Inc.

We’ll use the 2016 data for each to train our model, and then verify its accuracy against 2017 data. There may be times when accurate predictor data is not available and you will need to create a model to estimate that data, but we’ll leave that topic for another time.

The following five items will make up our input variables:

  • High temperature
  • Low temperature
  • Sunlight hours
  • Average humidity
  • Daytime hours (number of hours between sunrise and sunset)

The target of our prediction is the amount of electricity used each day.

  • Electricity usage (daily)

In other words, by providing it with data for temperature, sunlight, humidity, daytime hours, and electricity usage, the ML Board will look for connections between the variables. If it can do this well, it will be able to accurately predict electricity usage when shown only the other types of data.

Getting the input data

Weather data

First we’ll get our weather data from the Japan Meteorological Agency (JMA) website.

We’ve downloaded the weather data for this guide from the JMA website and made it available below.


weather_tokyo_utf8.csv

A CSV file of Tokyo-area data for daily high temp., low temp., sunlight hours, and average humidity from Apr. 1, 2016 through Jan. 4, 2017.

Click the link to the left to download the file.

If clicking the link just displays the weather data on the screen, try right-clicking the link and clicking the “save link as” option. This should allow you to download it as a CSV file.

We’ll focus on weather data for the Tokyo region in this sample. Since our electricity data covers TEPCO’s entire power grid, there may be some discrepancy in our model. While making the areas consistent across all of our data could lead to a more accurate model, we’ll just use these data sets to get started.

If you want to test things with the newest weather data, you can reference the search criteria below to download it directly from the JMA website. Their data search menus are currently available in Japanese only, but the GIFs below show the process we used to get the data available above.

Search term Condition
Select location
  • Region: Tokyo (東京)
  • Site: Tokyo (東京) only
Select items
  • Data types: Daily values
  • Options for comparing with past averages: Do not select any
  • Items
    • Temperature: Daily high temp and daily low temp
    • Sunlight/solar radiation: Hourly sunlight
    • Humidity/atmospheric pressure: Daily average relative humidity
             
Select period
  • Display as continuous period:
    Period: From 2016/4/1 to 2017/1/4
Select display options
  • Data requiring special permission: Do not display
  • Irregular data due to changes in observation environment: Do not display
  • CSV file data specifications: Store all as numerical values:
    • Date format: Store as literal dates
  • Other: Show days of the week
Selecting additional display options
Sunrise and sunset data

This data can be downloaded from the National Astronomical Observatory of Japan website. We’ve downloaded the data used in this guide from their homepage and prepared it for use below. We’ll use it later to calculate daytime hours.


sunrise_sunset_2016.csv

A CSV file of Tokyo-area sunrise and sunset times from Apr. 1, 2016 through Dec. 31, 2016.

Click the link to the left to download the file.

If clicking the link just displays the sunrise/sunset data on the screen, try right-clicking the link and clicking the “save link as” option. This should allow you to download it as a CSV file.


sunrise_sunset_2017.csv

A CSV file of Tokyo-area sunrise and sunset times from Jan. 1, 2017 through Jan. 4, 2017.

Click the link to the left to download the file.

If clicking the link just displays the sunrise/sunset data on the screen, try right-clicking the link and clicking the “save link as” option. This should allow you to download it as a CSV file.

The sunrise and sunset data are Copyrighted Works of the National Astronomical Observatory of Japan and subject to their Terms of Use

Electricity usage data

This data can be downloaded from the TEPCO Power Grid website. We’ll get data for both 2016 and 2017 here.

To download the files, simply right-click the 2016 and 2017 links and select “Save link as…” This should allow you to download CSV files for each year.

About the training data

Once all of the necessary data has been downloaded, we’ll prepare the training data for the ML Board. In order for the ML Board to be able to train properly, we need to split the data into two types.

  1. Training set: The data used to train our model.
  2. Validation set: This data is not used to train the model, rather to evaluate it. By doing so, its accuracy predicting for unknown data (the generalization error) can be evaluated by the ML board.

Both data types should be prepared as CSV files. There is only one important rule to remember when formatting your data files: make sure to order the data with your input variables first, and then place the value that will be predicted for last (must be a numerical value).

With that in mind, we’ll arrange our data in the following order (with electricity usage last):

Daily high temp, low temp, sunlight hours, average humidity, daytime hours, electricity usage

For Full Service Plan users

Please upload the two files into the Google Cloud Storage (GCS) location prepared by BLOCKS.

As a Google account is required to use this GCS location, you must first register an account into the GCP access section of the Project settings menu.

If you do not have a Google account, refer to Creating a Google account and register your new account into the GCP access section of the Project settings menu.

For Self-Service Plan users

Please create a bucket in GCS and store the two files into it. We save our files into a bucket named gs://magellan-blocks-doc-train/ in this guide, but feel free to choose your own bucket name.

GCS buckets used here should have default storage class set to Regional and location set to us-central1. Operations cannot be guaranteed for buckets with different settings.

How to create the training data

The simplest method for creating the training data is to use a spreadsheet program like Google Sheets or Excel. Insert the data into the program with each input variable aligned by column. Make sure the final column has the data for the value that will be predicted, then save as a CSV file.

While this may be the simplest method, it might not be practical or possible when dealing with large amounts of data. In this case, or if you want to automate things, it would be more practical to use a BLOCKS Big Data Board, which can be automated and process big data quickly.

The data in this example isn’t all that large, but we’ll briefly introduce how to create a Big Data Board. That way, those interested can know how to use it to automate and simplify things in the future.

If you’d rather just use a spreadsheet program for now, feel free!

Uploading the data

In order to put the data into the Big Data Board, you’ll first need to upload it into a GCS bucket. Before we can create our training data, the preliminary data that will be used to create it needs to be uploaded first.

In this tutorial, we’ve used the following bucket and file names. However, you are free to use whatever names you prefer.

  • GCS bucket: gs://magellan-blocks-doc-data/
  • Data file names:
    • Weather data: weather_tokyo_utf8.csv
    • Daytime hours data (2016): sunrise_sunset_2016.csv
    • Daytime hours data (2017): sunrise_sunset_2017.csv
    • Electricity usage (2016): juyo-2016.csv
    • Electricity usage (2017): juyo-2017.csv

Make sure each file is set to code UTF-8. In particular, make sure to do if you downloaded the weather data from the Japan Meteorological Agency website, as it is not formatted as such when downloaded. (The file hosted on this page is already set to UTF-8)

Converting files to UTF-8 can be done easily using tools like nkf.

For Full Service Plan users

We will use the Google Cloud Console to upload the data.

  1. Open the GCP service accounts section of the Project settings menu.

    Project settings (GCP service account)
  2. Click the data upload link (a URL that starts with gs://) for a registered account. This opens the Google Cloud Console in a new tab.

    The **** portion of the “gs://****” URL is the GCS bucket name. Read any instances of “bucket name” in the rest of this document as the bucket name displayed in this URL.

  3. Click the “Upload File” button at the top of the Google Cloud Console and upload the iris data.

    Google Cloud Console GCS screen (upload data)

Once you’ve done this, the data should all be uploaded and accessible to BLOCKS. Now we’re ready get started with our big data processing.

For Self-Service Plan users

We’ll use the gsutil command shown below to upload our data.

gsutil mb -c regional -l us-central1 gs://BUCKETNAME
gsutil cp FILENAME gs://BUCKETNAME

Example:

gsutil mb -c regional -l us-central1 gs://magellan-blocks-doc-data
gsutil cp weather_tokyo_utf8.csv gs://magellan-blocks-doc-data

See this page for more information about installing gsutil.

Once you’ve done this, the data should all be uploaded and accessible to BLOCKS. Now we’re ready get started with our big data processing.

Sending data to the Big Data Board

The Big Data Board uses Google’s BigQuery service to process data. BigQuery allows for not only processing of unlimited amounts of data, it also provides the fastest speeds in the world at an extremely affordable price. With BLOCKS, using BigQuery for big data analysis is simple. Those interested in doing data analysis beyond the simple data creation in this tutorial should try using BigQuery with a Big Data Board.

Our data has been uploaded into GCS, but now we’ll need to process it and load it into BigQuery.

To do this, we connect on-screen BLOCKS with various functions into Flows that make up the basic unit of processing in MAGELLAN BLOCKS.

A Flow always starts with a Start of Flow BLOCK and ends with an End of Flow BLOCK. These two BLOCKS are found in the Basic section of the BLOCK list.

Within the Start of Flow BLOCK’s properties, you can set a schedule for timed execution, or press a button to run the Flow right immediately. You can also set a Flow ID, which can be used to execute the Flow from external programs or connect it to your company's systems.

Reading in the weather data

First, we’ll read in the weather data using the Load to single table from GCS BLOCK from the BigQuery section of the BLOCK list.

Connect this BLOCK to a Start of Flow BLOCK, then finish making the Flow with an End of Flow BLOCK.

Read in weather data

If the default names are hard to understand, you can change the name displayed on any BLOCK within its properties.

Set the properties of the Load to single table from GCS BLOCK as shown in the chart below.

Property Value
Source data file URL in GCS gs://magellan-blocks-doc-data/weather_tokyo_utf8.csv
Destination data set blocks_ml_juyo
Destination table tokyo_city_20160401_20170104_raw
Schema settings
date STRING NULLABLE
weekday STRING NULLABLE
max_temp STRING NULLABLE
min_temp STRING NULLABLE
hours_of_sunshine STRING NULLABLE
other STRING NULLABLE
humidity_avg STRING NULLABLE

To quickly configure your schema settings to those shown above, simply click the “Edit as JSON” link and copy/paste in the following code. (To copy the code, click the "copy to clipboard" button that appears in its upper-right corner)

[
 {
  "name": "date",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "weekday",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "max_temp",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "min_temp",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "hours_of_sunshine",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "other",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "humidity_avg",
  "type": "STRING",
  "mode": "NULLABLE"
 }
]
In cases of non-empty tables Overwrite
File format CSV
CSV delimitation character Comma
Number of skipped rows

6

Note: This is based on the CSV file containing the weather data having a 6 row header (rows at the top that don’t contain data). Set this to 0 for files with no header rows.

By doing this, the data from the file in GCS will be loaded into a BigQuery table. We’ve set the schema settings required for the table here.

Next we’ll click the save button in the upper-right after configuring everything. Flows cannot be executed until after they have been saved.

After saving, we’ll execute the Flow by selecting the Start of Flow BLOCK and clicking the button within its properties.

Flow execution results are output to the logs section at the bottom of the screen. For more details, refer to the Logs section of the Basic Guide.

If a Flow execution fails, review each BLOCK’s properties to check for mistakes. Fix anything you might find, save, and try executing it again.

We’ll follow this same process for executing each of the following Flows, too.

Reading in the sunrise/sunset data

We’ll read in the sunrise and sunset data next. Create a flow just like the one before, with a Start of Flow BLOCK, Load to single table from GCS BLOCK, and an End of Flow BLOCK.

read in sun data

Configure the properties of the first Load to single table from GCS BLOCK (2016 data) as shown below:

Property Value
Source data file URL in GCS gs://magellan-blocks-doc-data/sunrise_sunset_2016.csv
Destination data set blocks_ml_juyo
Destination table tokyo_city_sunrise_sunset_2016_raw
Schema settings
date STRING NULLABLE
sunrise_time STRING NULLABLE
sunrise_orientation STRING NULLABLE
meridian_passage STRING NULLABLE
altitude STRING NULLABLE
sunset_time STRING NULLABLE
sunset_orientation STRING NULLABLE

To quickly configure your schema settings to those shown above, simply click the “Edit as JSON” link and copy/paste in the following code. (To copy the code, click the "copy to clipboard" button that appears in its upper-right corner)

[
 {
  "name": "date",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "sunrise_time",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "sunrise_orientation",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "meridian_passage",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "altitude",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "sunset_time",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "sunset_orientation",
  "type": "STRING",
  "mode": "NULLABLE"
 }
]
In cases of non-empty tables Overwrite
File format CSV
CSV delimitation character Comma
Number of skipped rows

1

Note: This is based on the CSV file containing the sunrise and sunset data having a 1 row header (rows at the top that don’t contain data). Set this to 0 for files with no header rows.

Then, configure the properties for second Load to single table from GCS BLOCK identically, except for changing the 2016 in the “Source data file URL in GCS” and “Destination table” settings to 2017.

After configuring the properties, click save and execute the Flow.

Reading in the electricity data

Finally, we’ll read in the electricity usage data. Create the same Flow as before.

Read in electricity data

Configure the first Load to single table from GCS BLOCK as below:

Property Value
Source data file URL in GCS gs://magellan-blocks-doc-data/juyo-2016.csv
Destination data set blocks_ml_juyo
Destination table juyou_2016_raw
Schema settings
date STRING NULLABLE
time STRING NULLABLE
used STRING NULLABLE

To quickly configure your schema settings to those shown above, simply click the “Edit as JSON” link and copy/paste in the following code. (To copy the code, click the "copy to clipboard" button that appears in its upper-right corner)

[
 {
  "name": "date",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "time",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "used",
  "type": "STRING",
  "mode": "NULLABLE"
 }
]
In cases of non-empty tables Overwrite
File format CSV
CSV delimitation character Comma
Number of skipped rows

3

Note: This is based on the CSV file containing the electricity usage data having a 3 row header (rows at the top that don’t contain data). Set this to 0 for files with no header rows.

Use the same settings for the 2017 BLOCK, again changing 2016 to 2017 in the “Source data file URL in GCS” and “Destination table” settings.

After configuring the properties, click save and execute the Flow.

Creating the data

Now that all of the initial data has been read into BigQuery, we can make our training data.

However, since each each set of data is formatted differently, we’ll first standardize them all together. To do so, we’ll follow these steps:

  1. Standardizing the date format: we’ll make it so all single digit months and dates have a 0 in their tens places.

  2. Unify the data’s degree: There is electricity usage data for each hour, while the other data sets have daily records. We’ll convert the electricity usage data into daily values.

  3. Joining the data together by date: We’ll join the data together and arrange all records by date.

The Flow used to create the training data is a bit long, but it won't be too complicated since it just involves connecting them together.

As before, place with a Start of Flow BLOCK, then place a Parallel branch BLOCK so we can process two things simultaneously.

Merge data
Standardizing the data's format
Standardizing the weather data

First, we’ll start by explaining the BLOCK that will standardize the way the date is written in the weather data. We’ve named this Execute query BLOCK as “Execute query (unify weather data key)” in the image above. Set this BLOCK’s properties as follows:

Property Value
SQL syntax Standard SQL
Query
SELECT  REGEXP_REPLACE(REGEXP_REPLACE(date,"(/)([0-9]{1})($)","\\10\\2\\3"),"(/)([0-9]{1})(/)","\\10\\2\\3") AS date,
  max_temp,
  min_temp,
  hours_of_sunshine,
  humidity_avg
FROM
  blocks_ml_juyo.tokyo_city_20160401_20170104_raw
ORDER BY
  date
Result storage data set blocks_ml_juyo
Result storage table weather_2016_2017
In cases of non-empty tables Overwrite
Standardizing the electricity data

The electricity usage data is a bit more complicated. We’ll need to convert the hourly data into daily data.

Doing this all at once is a bit difficult, but it’s actually pretty simple if we do it in two steps. We’ll do this using more than one BLOCK to make things easier more maintainable for us.

Standardizing the date format

First, we’ll standardize the electricity usage data’s date formatting. We’ve named the Execute query BLOCK used for this as “Execute query (unify elec. data key)” in the image above.

Set its properties as shown below:

Property Value
SQL syntax Standard SQL
Query
SELECT  REGEXP_REPLACE(REGEXP_REPLACE(date,"(/)([0-9]{1})($)","\\10\\2\\3"),"(/)([0-9]{1})(/)","\\10\\2\\3") AS date,
  time,
  used
FROM
  blocks_ml_juyo.juyou_2016_raw
ORDER BY
  date,
  time
Result storage data set blocks_ml_juyo
Result storage table juyou_2016_tmp
In cases of non-empty tables Overwrite
Converting hourly data to daily data

Next, we’ll convert the hourly data into daily data using another Execute query BLOCK. We’ve renamed this BLOCK as “Execute query (convert to daily)” in the image above.

Configure this BLOCK as below:

Property Value
SQL syntax Standard SQL
Query
SELECT
  date,
  SUM(cast(used as int64)) as used_sum
FROM
  blocks_ml_juyo.juyou_2016_tmp
GROUP BY
  date
ORDER BY
  date
Destination dataset blocks_ml_juyo
Destination table juyou_2016
In cases of non-empty tables Overwrite

Joining data together by date

Lastly, we’ll join all of our data together and arrange it by date. This is done with another Execute query BLOCK, which we’ve renamed as “Execute query (join data together)” in the image above.

Set this BLOCK's properties as below:

Property Value
SQL syntax Standard SQL
Query
SELECT
  juyo.date,
  weather.max_temp,
  weather.min_temp,
  weather.hours_of_sunshine,
  weather.humidity_avg,
  daytime.sunrise_time,
  daytime.sunset_time,
  juyo.used_sum
FROM
  blocks_ml_juyo.juyou_2016 AS juyo
LEFT JOIN
  blocks_ml_juyo.weather_2016_2017 AS weather
ON
  juyo.date = weather.date
LEFT JOIN
  blocks_ml_juyo.tokyo_city_sunrise_sunset_2016_raw AS daytime
ON
  weather.date = daytime.date
ORDER BY
  date
Result storage data set blocks_ml_juyo
Result storage table juyo_all_data_2016
In cases of non-empty tables Overwrite

After configuring the properties, click save and execute the Flow.

Removing irregular data

Next we need to make sure to remove any irregularities that might exist in the data. For example, the public weather data contains values in which the machinery may have been broken or otherwise unable to make measurements. If we leaves these sorts of values in our data, the ML Board might make mistakes as it looks for patterns and features in the data.

So, let's get rid of any strange values in our data.

Aggregate data

In the weather data we’re using this time, some of the average humidity data contains the value “null”. We’ll use another Execute query BLOCK to remove these.

Property Value
SQL syntax Standard SQL
Query
SELECT
  *
FROM
  blocks_ml_juyo.juyo_all_data_2016
WHERE
  date != "null"
  AND used_sum IS NOT NULL
  AND max_temp != "null"
  AND min_temp != "null"
  AND hours_of_sunshine != "null"
  AND humidity_avg != "null"
  AND sunrise_time != "null"
  AND sunset_time != "null"
ORDER BY
  date
Destination dataset blocks_ml_juyo
Destination table juyo_all_data_2016_no_error
In cases of non-empty tables Overwrite
Standardizing the data type

Next we’ll standardize the data types using another Execute query BLOCK. This will make it simple to compile our data. If an error occurs during conversion, there might still be irregular values in the data. In that case, please take another look through the data.

Property Value
SQL syntax Standard SQL
Query
SELECT
  date,
  used_sum,
  CAST(max_temp AS float64) AS max_temp,
  CAST(min_temp AS float64) AS min_temp,
  CAST(hours_of_sunshine AS float64) AS hours_of_sunshine,
  CAST(humidity_avg AS float64 ) AS humidity_avg,
  PARSE_TIMESTAMP("%%R", sunrise_time ) AS sunrise_time,
  PARSE_TIMESTAMP("%%R", sunset_time ) AS sunset_time
FROM
  blocks_ml_juyo.juyo_all_data_2016_no_error
Destination dataset blocks_ml_juyo
Destination table juyo_all_data_2016_cast
In cases of non-empty tables Overwrite
Calculating daytime hours

Next we’ll calculate number of daytime hours from the sunrise and sunset data using another Execute query BLOCK.

Property Value
SQL syntax Standard SQL
Query
SELECT
  date,
  max_temp,
  min_temp,
  hours_of_sunshine,
  humidity_avg,
  TIMESTAMP_DIFF(sunset_time,sunrise_time,minute) AS daytime_time,
  used_sum
FROM
  blocks_ml_juyo.juyo_all_data_2016_cast
ORDER BY
  date
Destination dataset blocks_ml_juyo
Destination table juyo_all_data_2016_daytime
In cases of non-empty tables Overwrite

After configuring the properties, click save and execute the Flow.

Creating the training and validation sets

We’ve now finished preparing our initial data. Now we’ll use it to create the training data for our ML Board.

There are actually two parts to this data: a training set and a validation set. The ML Board will analyze the training set for patterns and features to create the predictive model. Then, it uses the validation set to test its accuracy. By using these two types of data, the ML Board can “learn” correctly.

There’s one thing to be careful about when preparing the training data. Right now our data is arranged in chronological order. If data is given from just one period of time, the ML Board might find patterns specific to that period. For example, if it is given only data from the summer, it might not be good at finding features in data from the winter.

In order to avoid this affecting the predictive model, we need to randomize the order of the data. Also, in order to ensure that we do enough training, we’ll split the data in an 8:2 ratio for our training and validation sets.

To accomplish this, we’ll assign a random number value to each record. Then, we’ll arrange the data according to these numbers. Finally, we’ll take the separate the top 80% from the rest.

Create training data Flow

We'll use Execute query BLOCKS to do this. The first will assign the random numbers and arrange the data.

Property Value
SQL syntax Standard SQL
Query
WITH
  tmp AS (
  SELECT
    date,
    used_sum,
    max_temp,
    min_temp,
    hours_of_sunshine,
    humidity_avg,
    daytime_time,
    RAND() AS rand
  FROM
    blocks_ml_juyo.juyo_all_data_2016_daytime)
SELECT
  date,
  used_sum,
  max_temp,
  min_temp,
  hours_of_sunshine,
  humidity_avg,
  daytime_time,
  ROW_NUMBER() OVER (ORDER BY rand) AS row_num
FROM
  tmp
order by row_num
Destination dataset blocks_ml_juyo
Destination table juyo_all_data_2016_no
In cases of non-empty tables Overwrite

Now we'll use another Execute query BLOCK to create the training set as a BigQuery table.

Property Value
SQL syntax Standard SQL
Query
SELECT
  max_temp,
  min_temp,
  hours_of_sunshine,
  humidity_avg,
  daytime_time,
  used_sum
FROM
  blocks_ml_juyo.juyo_all_data_2016_no
WHERE
  row_num <= 219

row_num <= 219 will differ depending on the number of rows in the data. Since we are using data with 274 rows in this example, calculating an 8:2 split for the training and validation sets resulted in 219.

Destination dataset blocks_ml_juyo
Destination table train_data
In cases of non-empty tables Overwrite

We’ll now export the data in the table as CSV file saved into GCS. We’ll do this with an Export single table to GCS BLOCK.

Property Value
Destination GCS URL gs://magellan-blocks-doc-train/train_data.csv
Source dataset blocks_ml_juyo
Source table train_data

Next we'll create the validation set as a BigQuery table with another Execute query BLOCK.

Property Value
SQL syntax Standard SQL
Query
SELECT
  max_temp,
  min_temp,
  hours_of_sunshine,
  humidity_avg,
  daytime_time,
   used_sum
FROM
  blocks_ml_juyo.juyo_all_data_2016_no
WHERE
  row_num > 219 
Destination dataset blocks_ml_juyo
Destination table val_data
In cases of non-empty tables Overwrite

Now we'll export the data from the table to GCS as a CSV file using an Export single table to GCS BLOCK, just as we did for the training set.

Property Value
Destination GCS URL gs://magellan-blocks-doc-train/val_data.csv
Source dataset blocks_ml_juyo
Source table val_data

After configuring the properties, click save and execute the Flow.

Training

With our data prepared, we’re ready to start training. With the BLOCKS ML Board, training and prediction are actually incredibly simple. Thinking of relevant input variables and preparing the data may take a while, but good data is everything when it comes to accurate machine learning.

Creating the ML Board

We’ll create a new ML Board by clicking the “Create Board” button at the top of the Blocks list and doing the following:

新規ボード作成

Select ML Board from the list of Board types.

ML ボード選択

Select the regression type.

タイプ選択

Enter a name for the new Board.

ボード名設定

↓ The following steps only appear when using the Self-Service Plan. ↓

Since BLOCKS uses Google Cloud Platform internally (Cloud Machine Learning Engine and other platforms), we’ll need to upload a GCP service account file and make sure all required APIs are enabled.

GCP Service Account Settings

The training results will be uploaded to GCS, so select the bucket and directory where you’d like them to be saved.

Set GCS buckets used with ML Boards to the Regional default storage class and the us-central1 location. We cannot guarantee that buckets with different settings will operate properly.

Storage Settings

↑ The above steps only appear for Self-Service Plan users. ↑

Now we need to enter the training data’s item names, types, and dimensions into the ML Board.

Be sure to only enter information for the input variables at this time. Do not enter information for the results value (in this case, the electricity usage data) which the ML Board will try to predict for.

トレーニングデータ設定画面(数値回帰タイプ)

Set each field as follows:

  • Output dimenssions: 1
  • Items:
    Item name Type Dimensions Explanation
    max_temp Numerical value 1 Daily high temperature.
    min_temp Numerical value 1 Daily low temperature
    hours_of_sunshine Numerical value 1 Number of hours sun was shining each day
    humidity_avg Numerical value 1 Daily average humidity
    daytime_time Numerical value 1 Hours between sunrise and sunset each day

Review and confirm all of the settings and click “Finish”.

入力内容確認

Starting the training

Now that the ML Board is created, we can start training our model.

ML ボード詳細

Click the Start training button.

トレーニング開始
  • Enter a name for the training.
  • Enter the GCS URLs for the training set and validation sets.
    • Example training set URL: magellan-blocks-doc-train/train_data.csv
    • Example validation set URL: magellan-blocks-doc-train/eval_data.csv

    These URLs should only contain ASCII letters, numbers, underscores (_), hyphens (-), or slashes (/).

    Replace the magellan-blocks-doc-train bucket name portion with your own GCS environment's bucket name.

  • If you want the training to stop itself after a certain amount of time, configure the “Max. time until timeout (minutes)” property.
  • Set the “Max. number of trials” property. By setting more than one trials, the ML Board can adjust its parameters automatically until it finds the most accurate tuning.

Click the “Create” button to start the new training. You can check on its status in the training list.

ML ボード詳細

Applying the training results

Once completed, the training’s status will change to succeeded and an Apply button will appear. Clicking this button will end the training and make it usable for predictions.

Please wait a short while after clicking the apply button before running any predictions, as running them immediately may cause errors in the current version. If an error does occur, wait another few minutes and try running your prediction again.

Making predictions

Now we’ll create a Flow on a Big Data Board to make predictions using our trained model.

Making a predictive Flow

We can make predictions on a Big Data Board.

To do this, we’ll make the following Flow:

Predictive Flow

First we’ll place a Construct Object BLOCK from the Basic section of the BLOCK list. We’ll enter input variable data for days we want to predict electricity usage for into this BLOCK.

Property Value
Data
オブジェクト生成ブロックのデータプロパティ例(回帰)

The data.0 / data.1 / data.2 / data.3 objects each contain data for one of the four days we want to predict electricity usage for. We use an Array when sending multiple sets of data to the Construct Object BLOCK. This is done by clicking the + button next to “Array” to add data.0, data.1, data.2, and data.3

We then add data for each day as objects within the array. Each object contains an identifier for the data (key), daily high temperature data (max_temp), low temperature data (min_temp), hours of sunshine data (hours_of_sunshine), average humidity data (humidity_avg), and hours between sunrise and sunset data (daytime_time). We add each component of data by clicking the + button next to “Object”.

Connect a ML Board prediction (online) BLOCK from the Machine Learning category into the Flow and set its properties as follows:

Property Value
ML Board (The ML Board’s name)
Input Variable _.data

There are lots of different ways to use prediction results, but for today we’ll load them into BigQuery. We’ll use a Load to table from variable BLOCK from the BigQuery section of the BLOCK list to do this.

Property Value
Source data variable _.predictions
Destination dataset blocks_ml_juyo
Destination table predict_result
Schema settings
output FLOAT NULLABLE
key STRING NULLABLE

To quickly configure your schema settings to those shown above, simply click the “Edit as JSON” link and copy/paste in the following code. (To copy the code, click the "copy to clipboard" button that appears in its upper-right corner)

[
 {
  "name": "output",
  "type": "FLOAT",
  "mode": "NULLABLE"
 },
 {
  "name": "key",
  "type": "STRING",
  "mode": "NULLABLE"
 }
]
In cases of non-empty tables Overwrite
File format NEWLINE_DELIMITED_JSON

If we want to get the results as a file, we can use an Export single table to GCS BLOCK configured as below:

Property Value
Destination GCS URL gs://magellan-blocks-doc-data/predict_result.json
Source data set blocks_ml_juyo
Source table predict_result
File format NEWLINE_DELIMITED_JSON
Implementing the prediction

With all of our preparations done, let’s run the prediction.

Click the button in the Start of Flow BLOCK’s properties.

The results will be saved as a JSON file in GCS as shown below:

File name: gs://magellan-blocks-doc-data/predict_result.json

{"output":84442.0390625,"key":"20170101"}
{"output":81434.53125,"key":"20170102"}
{"output":78059.7421875,"key":"20170103"}
{"output":75184.3125,"key":"20170104"}

"output" contains the predicted electricity usage value.

Evaluating the model

Now that we have our predictions, we can evaluate our trained model. When you’ve set benchmarks beforehand, you can compare your results to them to evaluate it.

We haven't set benchmarks in this example, so we'll think about ways we could improve our model in the following sections.

Increasing the amount of data

We only used data for one year, 2016, to train the model in this sample. So, we might be able to achieve more accurate results if we increased number of years in our data.

Unfortunately, only one year’s worth is currently available in the TEPCO public data. Does your company have data for earlier years? At the very least, we think about three years worth of data would help improve the model’s accuracy.

Increasing trial number and length

Increasing the number and length of the training's trials can also affect accuracy.

By configuring the trial number setting, the ML Board will adjust parameters as it runs repeated trials. Running more trials gives the ML Board more opportunities to find parameters that give accurate results.

We can also configure a limit on how many minutes each trial will run. Increasing this limit gives the ML Board more time for its calculations, and could lead to higher accuracy.

Running more trials and allowing more time per trial can be helpful, but it also makes the training take that much longer. You'll have to decide on the balance that works best for your situation.

Reviewing the input variables

What else might affect electricity usage? Maybe using data about the type of day could have an influence. For example, perhaps more electricity is used on weekdays than weekends, since that’s when most business is done. Along the same lines, maybe usage level is influenced by public holidays.

So, let’s try to incorporate these factors to improve our input variables.

The standard error, or accuracy, of the model we just trained was about 80,000,000 kw.

This may sound large, but considering the electricity usage rates range between 600,000,000 kw and 900,000,000 kw, the model actually seems quite usable already. However, we’ll add our new inputs to see if they can increase our model’s accuracy even further.

Our new input variables will be the day of the week, and holidays.

First, we add two columns to our CSV files for these variables.

Training data

In the example data above, column #6 contains the day of the week information, with Monday input as 0, Tuesday as 1, and so on. These are set as numerical values here, but will be will be shown as the day of the week again later. Column #7 contains the holiday information, with a 1 showing the day was a holiday, and a 0 for non-holidays.

Next we change the ML Board’s schema.

Select “day” as the type for the day of the week data. For the holidays, choose string, then enter “0,1” into the keyword list.

With the new schema defined, we’ll run a training just like we did before.

So how does our new model’s accuracy look?

By adding the new input variables, the standard error has been cut to approx. 45,000,000 kw, or about half that of the first model. That's about twice as accurate as before.

Summary

Finding input variables that will result in an accurate model often comes down to asking yourself very basic questions. Things like, “why do people use electricity?” or “why do people buy this product?”

As such, you'll need to be well versed with the runnings of your business to find the best input variables.

This ties into one of the reasons Groovenauts is trying to make machine learning accessible to everyone: so that business leaders—and not just programmers—can take a leading role in using machine learning.

We can’t wait to see the new and original models you make with the BLOCKS ML Board.