ML Board How-To Guide: Classification Type

ML Board How-To Guide

Categorize data with the classification model

Introduction

The ML Board’s classification model sorts data into various groups based on features in the data.

Some examples of common uses are:

  • Sorting which people will probably sign up (0) / not sign up (1) for membership campaigns.
  • Sorting which people are likely to click (0) / not click (1) on a web ad.
  • Sorting customers into a number of geographical groups (0 to n) based on buying habits.
  • Predicting normal (0) / fraudulent (1) use of credit cards or ATMs.

Things to know before starting

Input variables

The most demanding aspect of classification is selecting input variables that will lead to accurate results. For example, data on the date and place of use may be an important factor for finding fraudulent use of credit cards. What was bought with the card might not be an effective input, or it could be very important in some situations.

Choosing good input variables for your situation is central to machine learning, and doing so will dramatically increase the accuracy of your model.

Benchmarks

Before running the actual prediction classification, we highly recommend preparing benchmarks to measure your results against. These benchmarks could be the results of a trained expert sorting the data by eye.

By setting benchmarks, you can get a better idea of your classification model’s accuracy. Even if the model does not quite reach the benchmarks, you might find it practical to use from a cost and convenience perspective.

About this sample's situation

With that, let’s get started making a sample classification model.

We’ll be using a Ronald Fisher’s iris dataset for this classification. Fisher was an English scientist who specialized in statistics, evolutionary biology, and genetics.

In 1936, Fisher published a paper containing classification data on three types of iris (Iris setosa, Iris virginica, and Iris versicolour). The dataset includes petal length, width, and other information on 50 samples from each species.

Over time, Fisher’s irises have become a famous dataset for use in statistics and machine learning. We’ll use this dataset to train ML Board and see if we can create an effective classification model.

Making a predictive model with the ML Board

I’ll briefly explain each of the basic steps before we start actually using the ML Board.

  1. Choosing a model: Since we’ll be separating our data into groups, we’ll choose the classification model.

  2. Considering inputs: Deciding what to use as input variables is the most important part of creating a predictive model. We’ll be using Fisher’s iris data today.

  3. Creating the training data: We’ll put the input data into a CSV file for the ML Board to use as training data.

  4. Training: Next, we’ll use the data to train the ML Board. It will perform its own tuning and optimization automatically, allowing you to test various patterns and use the most accurate model.

  5. Evaluation: After training is complete, the model can be used to make a prediction on a BLOCKS Big Data Board. Here we can test our model’s prediction ability against actual results that were not used during training. In actual use, we might decide to go back and reconsider our inputs if the model’s accuracy is not high enough. By changing our input variables or how we provide these to the ML Board, we can try to create a more effective model.

  6. Utilization: Once we’ve created a practical model, there are all kinds of things we can do with it. With a BLOCKS IoT Board, we can feed in new data continuously, further training the model. We can then use it to make more accurate predictions on a BLOCKS Big Data Board.

    However you want to use machine learning, by using BLOCKS you can create a highly efficient and low-cost system that largely removes the need for programming and making complicated infrastructures.

Downloading machine learning data

The iris dataset can be downloaded from the University of California, Irvine website. We’ll be using the file named “bezdekIris.data”.

About the data

The data is organized into five columns, with the last column containing the class we will predict for:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm
  5. class (the names of the correct classifications)

The three species (classes) of iris are:

Iris setosa
Iris setosa
Iris versicolour
Iris versicolour
Iris virginica
Iris virginica

Creating the machine learning data

Next we’ll use the data we've downloaded to prepare the training data for our ML Board. This training data consists of two parts: a training set and a validation set.

First, the ML Board uses the training set to build the predictive model. After that, the validation set is used to test the new model’s accuracy.

By using both, the ML Board can automatically test its accuracy while it trains.

Both types of data should be prepared as CSV files. There is just one rule to follow when formatting the data: as you arrange the input variables on each line, make sure the results variable is the last value (must be a numerical value).

With that in mind, we’ll arrange our data as follows:

Sepal length, sepal width, petal length, petal width, class

Since the class needs to be a numerical value, we’ll change the various iris names into the following numbers (0–2):

  • 0: Iris setosa (Iris-setosa)
  • 1: Iris versicolour (Iris-versicolor)
  • 2: Iris virginica (Iris-virginica)
Creating machine learning data with a Big Data Board

One way to create the machine learning data is using a spreadsheet program like Excel or Google Sheets. Since we just need it to be a CSV file in the end, it’s fine if you’d like to use this method.

However, if you want to set an automated schedule or are working with large amounts of data, we recommend using a BLOCKS Big Data Board instead. By doing so, you can perform data analysis, create machine learning data, and make predictions using the model from the ML Board, all within BLOCKS.

We’ll follow these steps to create our data in BLOCKS:

  1. Upload the iris data to Google Cloud Storage (GCS)

    warning Note for Self-Service Plan users:
    GCS buckets used here should have default storage class set to Regional and location set to us-central1. Operations cannot be guaranteed for buckets with different settings to these.

  2. Import the uploaded file into the Big Data Board
  3. Randomize the order of the data
  4. Create the training set
  5. Create the validation set

Refer to this page for more information about using GCP with Big Data Boards.

info Note for Full Service Plan users:
A Google account is required to use the GCS location prepared by BLOCKS.

If you already have a Google account, register it into the GCP access section of the Project settings menu.

If you do not have a Google account, refer to Creating a Google account and register your new account into the GCP access section of the Project settings menu.

Uploading the iris data to GCS
For Full Service Plan users

We’ll use the Google Cloud Console to upload our data.

  1. Open the GCP service accounts section of the Project settings menu.

    Project settings (GCP service accounts)
  2. Click the data upload link (a URL that starts with gs://) for a registered account. This opens the Google Cloud Console in a new tab.

    infoThe **** portion of the “gs://****” URL is the GCS bucket name. Read any instances of “bucket name” in the rest of this document as the bucket name displayed in this URL.

  3. Create a GCS folder named “init”. We will upload our data into this folder. To do this, click the “Create Folder” button (1) at the top of the Google Cloud Console and name the new folder “init”.

    Cloud Console GCS screen (step 1)
  4. Click on the newly created “init” folder (2) to open it.

    Cloud Console GCS screen (step 2)
  5. Click the “Upload File” button (3) at the top of the Google Cloud Console and upload the iris data.

    Cloud Console GCS screen (step 3)
For Self-Service Plan users

We’ll use the gsutil command shown below to upload our data. We’re using the bucket name blocks-ml-class-demo, but you can name yours whatever you’d like.

gsutil mb -c regional -l us-central1 gs://bucketname
gsutil cp filename gs://bucketname (example: gs://blocks-ml-class-demo/init)

See this page for more information about installing gsutil.

Creating data with a Flow

The Big Data Board uses Google’s BigQuery service to process data. BigQuery allows for not only processing of unlimited amounts of data, it also provides the fastest speeds in the world at an extremely affordable price. With BLOCKS, using BigQuery for big data analysis is simple. Those interested in doing data analysis beyond the simple data creation in this tutorial should try using BigQuery with a Big Data Board.

Our data has been uploaded into GCS, but now we’ll need to process it and load it into BigQuery.

To do this, we connect on-screen BLOCKS with various functions into “Flows” that make up the basic unit of processing in MAGELLAN BLOCKS.

A Flow always starts with a Start of Flow BLOCK and ends with an End of Flow BLOCK. These two BLOCKS are found in the Basic section of the BLOCK list.

Within the Start of Flow BLOCK’s properties, you can set a schedule for timed execution, or press a button to run the Flow right immediately. You can also set a Flow ID, which can be used to execute the Flow from external programs or connect it to your company's systems.

Sending the iris data to the Big Data Board

With that, let’s read in the iris data. We’ll use the Load to single table from GCS BLOCK from the BigQuery section of the BLOCK list.

First, place a Start of Flow BLOCK, then connect a Load to single table from GCS BLOCK.

If the default names are hard to understand, you can change the name displayed on any BLOCK within its properties.

Set the properties of the Load to single table from GCS BLOCK as shown in the chart below.

Property Value
Source data file URL in GCS gs://blocks-ml-class-demo/init/bezdekIris.data
(blocks-ml-class-demo is the bucketname. Set this to whatever name you are using)
Destination dataset (Set to the name of the dataset containing your file)
Destination table iris_init
Schema settings
sepal_length FLOAT NULLABLE
sepal_width FLOAT NULLABLE
petal_length FLOAT NULLABLE
petal_width FLOAT NULLABLE
class STRING NULLABLE

To quickly configure your schema settings to those shown above, simply click the “Edit as JSON” link and copy/paste in the following code. (To copy the code, click the "copy to clipboard" button that appears in its upper-right corner)

[
 {
  "name": "sepal_length",
  "type": "FLOAT",
  "mode": "NULLABLE"
 },
 {
  "name": "sepal_width",
  "type": "FLOAT",
  "mode": "NULLABLE"
 },
 {
  "name": "petal_length",
  "type": "FLOAT",
  "mode": "NULLABLE"
 },
 {
  "name": "petal_width",
  "type": "FLOAT",
  "mode": "NULLABLE"
 },
 {
  "name": "class",
  "type": "STRING",
  "mode": "NULLABLE"
 }
]
In cases of non-empty tables Overwrite
Permit rows with insufficient fields Check the box
Randomizing the data

It’s important to avoid introducing discrepancies when making data for machine learning. For example, having less data about one type of iris might lead to ineffective training. The ratio of data and order it is given can affect the model, so we’ll be randomizing the order of our data in this example.

To do this, we’ll assign random values to the data using the Execute query BLOCK from the BigQuery section of the BLOCK list. Set its properties as shown below.

Property Value
SQL syntax Legacy SQL
Query
// blocks_ml_demo is our dataset's name. Replace this with your own.
SELECT
  rand(5) as rand_id,
  sepal_length,
  sepal_width,
  petal_length,
  petal_width,
  CASE
    WHEN class = "Iris-setosa" THEN 0
    WHEN class = "Iris-versicolor" THEN 1
    WHEN class = "Iris-virginica" THEN 2
    ELSE NULL
  END as class
FROM blocks_ml_demo.iris_init
WHERE class IS NOT NULL
ORDER BY rand_id
Result storage dataset (Choose your dataset that will store the results)
Result storage table iris_temp
In cases of non-empty tables Overwrite

Once random values have been assigned, we can create the training and validation data. We’ll use parallel processing to do this since the two are not dependent on each other. For this, we’ll place a Parallel branch BLOCK from the Basic category of the BLOCK list.

Creating the training set

We’ll use an Execute query BLOCK configured as below to create our training set.

Property Value
SQL syntax Legacy SQL
Query
// blocks_ml_demo is our datasets name. Replace this with your own.
SELECT
  sepal_length,
  sepal_width,
  petal_length,
  petal_width,
  class
FROM blocks_ml_demo.iris_temp
ORDER BY rand_id ASC
LIMIT 120
Result storage dataset (Choose your dataset that will store the results)
Result storage table iris_training
In cases of non-empty tables Overwrite
Writing the training set to GCS

In order for training set to be usable by the ML Board, it needs to be converted into a CSV file and written to GCS. We’ll use the Export single table to GCS BLOCK for this (found in the BigQuery section of the BLOCK list).

Property Value
GCS URL for the file to which output will be delivered gs://blocks-ml-class-demo/data/iris_training
(blocks-ml-class-demo is our bucket name, so replace it with your own.)
Source dataset The dataset containing the data to be exported to GCS
Source table iris_training
Output header line Uncheck the box
Creating the validation set

Next, we’ll create our validation set using another Execute query BLOCK.

Property Value
SQL syntax Legacy SQL
Query
// blocks_ml_demo is our dataset's name. Replace this with your own.
SELECT
  sepal_length,
  sepal_width,
  petal_length,
  petal_width,
  class
FROM blocks_ml_demo.iris_temp
ORDER BY rand_id DESC
LIMIT 30
Result storage dataset (Choose your dataset that will store the results)
Result storage table iris_training
In cases of non-empty tables Overwrite
Writing the validation data to GCS

Just like the training set, we’ll write the validation set into GCS as a CSV file using another Export single table to GCS BLOCK.

Property Value
GCS URL for the file to which output will be delivered gs://blocks-ml-class-demo/data/iris_test
(blocks-ml-class-demo is our bucket name. Replace this with yours.)
Source dataset to export (Choose the dataset containing the data to be exported to GCS)
Source table to export iris_test

Once you’ve finished making the Flow, click the save button in the upper-right corner of the screen. Flows always need to be saved before they can be executed.

Executing the Flow

Now that the Flow is complete, click the button from the Start of Flow BLOCK’s properties to execute it.

Once the Flow finished executing, the data for our ML Board will be ready. The files will be saved to GCCS as shown below:

Training set

  • gs://blocks-ml-class-demo/data/iris_training
    (blocks-ml-class-demo will be replaced with your bucketname)

Validation set

  • gs://blocks-ml-class-demo/data/iris_test
    (blocks-ml-class-demo will be replaced with your bucketname)

Training

Now that our data is ready, we can get started using an ML Board to train a classification model.

Creating the ML Board

To create our ML Board, we’ll do the following:

Click the “Create New Board” button at the top of the Board list.

新規ボード作成

Select ML Board from the list of Board types.

ML ボード選択

Choose “Classification type”.

タイプ選択

Enter a name for the Board.

ボード名設定

↓ The following steps only appear when using the Self-Service Plan. ↓

Since BLOCKS uses Google Cloud Platform internally (Cloud Machine Learning Engine and other platforms), we’ll need to upload a GCP service account file and make sure all required APIs are enabled.

GCP サービスアカウント設定

Next, we need to give the Cloud Machine Learning Engine platform access to GCS. Click the Activate Google Cloud Shell button in the upper right of the Google Cloud Platform dashboard.

This brings up the Google Cloud Shell screen. Enter the command shown below:

gcloud ml-engine init-project

Answer Y to the prompt about giving access permission.

Cloud ML Engine settings

The training results will be uploaded to GCS, so select the bucket and directory where you’d like them to be saved.

warning Set GCS buckets used with ML Boards to the Regional default storage class and the us-central1 location. We cannot guarantee that buckets with different settings will operate properly.

ストレージ設定

↑ The above steps only appear when using the Self-Service Plan. ↑

Now we need to enter the training data’s item names, types, and dimensions into the ML Board.

トレーニングデータ設定

We’ll configure each setting as follows:

Number of classes: 3 (0: Iris setosa, 1: Iris versicolour, 2: Iris virginica)

Item name Type Dimensions Explanation
sepal_length Numerical value 1 The sepal length
sepal_width Numerical value 1 The sepal width
petal_length Numerical value 1 The petal length
petal_width Numerical value 1 The petal width

Check to make sure everything is correct on the confirmation screen, then click “Finish”.

入力内容確認

Starting the training

With our ML Board ready, we can now start training our model.

ML ボード詳細

Click the “Start Training” button.

トレーニング開始
  • Enter a name for the training.
  • Enter the GCS URLs for the training set and validation sets.
    • Example training set URL: blocks-ml-class-demo/data/iris_training
    • Example validation set URL: blocks-ml-class-demo/data/iris_test
  • If you want the training to stop itself after a certain amount of time, configure the “Max. time until timeout (minutes)” property.
  • Set the “Max. number of trials” property. By setting more than one trials, the ML Board can adjust its parameters automatically until it finds the most accurate tuning.

Click the “Create” button to start the new training. You can check on its status in the training list.

ML ボード詳細

Applying the training results

An “Apply” button will appear once the training completes successfully. Click it to end the training and make it usable for predictions.

Please wait a short while after clicking the apply button before running any predictions, as running them immediately may cause errors in the current version. If an error does occur, wait another few minutes and try running your prediction again.

Making predictions

Making a predictive Flow

We’ll create a Flow on a Big Data Board to use for making predictions with our trained model.

予測フロー

First, we'll place a Construct Object BLOCK from the Basic section of the BLOCK list into the Flow.

Then, we’ll enter data for irises we want the model to try and classify into this BLOCK as shown below:

Property Value
Data
Data property of the Construct Object BLOCK (classification)

The data.0 / data.1 / data.2 objects each contain data for an iris from one of the three types the model will predict for. We use an Array when sending multiple sets of data to the Construct Object BLOCK. This is done by clicking the + button next to “Array” to add data.0, data.1, and data.2

Data for each iris is entered as an object within the array. Each object contains an identifier for the data (key), the sepal length data (sepal_length), the sepal width data (sepal_width), the petal length data (petal_length), and the petal width data (petal_width). We add each component of data by clicking the + button next to “Object”.

Place a ML Board Predict BLOCK from the Machine Learning section of the BLOCK list into the Flow next and set its properties as shown below.

Property Value
GCP service account (Your service account)
ML Board (Your ML Board’s name)
Input Variable _.data
Output Variable _out

There are many different ways to use prediction results, but for today we’ll load them into BigQuery using a Load to table from variable BLOCK from the BigQuery section of the BLOCK list.

Property Value
GCP service account Select your GCP service account
Source data variable _out.predictions
Destination dataset Choose the dataset you want the data to be sent to
Destination table iris_predict
Schema settings
key STRING NULLABLE
label INTEGER NULLABLE
score FLOAT REPEATED

To quickly configure your schema settings to those shown above, simply click the “Edit as JSON” link and copy/paste in the following code. (To copy the code, click the "copy to clipboard" button that appears in its upper-right corner)

[
 {
  "name": "key",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "label",
  "type": "INTEGER",
  "mode": "NULLABLE"
 },
 {
  "name": "score",
  "type": "FLOAT",
  "mode": "REPEATED"
 }
]
In cases of non-empty tables Overwrite
File format NEWLINE_DELIMITED_JSON

We’ll also use an Output to log BLOCK (from the Basic category) to confirm our results right on the Big Data Board’s screen.

Property Value
Data export variable _out.predictions

Implementing the prediction

With all of our preparations done, let’s run the prediction.

Click the button in the Start of Flow BLOCK’s properties.

The results will be output to the logs at the bottom of the screen, so open those up to confirm.

[
  {
    "score": [0.10034828633069992, 0.5175866484642029, 0.3820650279521942],
    "key": "1",
    "label": 1
  },
  {
    "score": [0.040482282638549805, 0.45981335639953613, 0.49970442056655884],
    "key": "2",
    "label": 2
    },
  {
    "score": [0.9810823202133179, 0.01643194630742073, 0.0024857944808900356],
    "key": "3",
    "label": 0
  }
]

The results are output with predicted label and score values for each key configured in the Construct Object BLOCK.

The label value refers to which class the Board predicts the key to belong to. For our set, these are defined as below:

  • 0:Iris setosa
  • 1:Iris versicolour
  • 2:Iris virginica

The scores value shows the confidence level in classifying the key, with 1.0 meaning 100% confidence.