Model Generator How-To: Classification

Categorize data with the classification model

Introduction

With the Model Generator’s classification model, you can use Machine Learning to predict the category that data belongs to.

Some example uses include:

  • Predict which people will not sign up (0) / sign up (1) for a membership campaign.
  • Predict which people will not click (0) / click (1) a web ad.
  • Sort customers into various groups (0 to n) based on their interests.
  • Predict whether credit card usage at an ATM is normal (0) / fraudulent(1).

Things to know before starting

Input variables

The most important aspect of classification, and Machine Learning generally, is preparing good input variables, or predictors. These are the various pieces of data that you use to train your model.

Selecting good input variables has a large effect on the accuracy of your model. For example, the date and place where credit cards were used might be important factors for finding fraudulent credit card usage. Likewise, what was bought with the card might not end up being an effective input variable, or perhaps it could in some situations.

Selecting good input variables has a large effect on the accuracy of your model. For example, the date and place where credit cards were used might be important factors for finding fraudulent credit card usage. Likewise, what was bought with the card might not end up being an effective input variable, or perhaps it could in some situations.

Benchmarks

We recommend setting some benchmarks to measure your results against before creating and using a classification model. For example, you could use the accuracy level of a trained expert sorting your data by eye as your benchmarks.

By setting benchmarks, you can better judge the effectiveness of your model. Even if your model doesn’t quite reach the benchmarks, you might find it practical to use from a cost or convenience perspective.

About this guide’s scenario

This guide shows how to create a model to classify iris species using Ronald Fisher’s dataset. Fisher was an English scientist who specialized in statistics, evolutionary biology, and genetics.

In 1936, Fisher published a paper containing classification data on three types of iris: Iris setosa, Iris virginica, and Iris versicolor. The dataset includes information like petal length and petal width for 50 samples of each species.

Overview of making predictive systems with the Model Generator

Before we start using the Model Generator, I’ll briefly introduce the basic process of making a predictive Machine Learning system with BLOCKS:

  1. Select a model: Since we will be categorizing numerical data, we’ll select the Classification model. We will continue to add more model types to the Model Generator in the future.

  2. Consider our input variables: Deciding what to use as input variables is the most important part of creating a predictive model. For this guide, we’ll simply be using Fisher’s iris data.

  3. Create the training data: The Model Generator uses CSV files for its training data. We’ll prepare our iris dataset to be used with the Model Generator.

  4. Train the model: Once the training data is ready, we’ll use a Model Generator to train a model. The Model Generator performs various tunings and optimizations automatically, allowing you to test various patterns and use the most accurate results.

  5. Evaluate the model: Once trained, we can use our model to make predictions. We can do this from a Flow Designer, and by using data not used during the training, we can evaluate the accuracy of our model.

    In an actual use situation, you might feel that the results are not accurate enough. In this case, you could reconsider your input variables and how you provide these to the Model Generator and try to train a more accurate model.

  6. Use the model: Once you have a useable model, there are all kinds of things you can do with it in BLOCKS. You can make predictions using the Flow Designer, collect input variable data continuously with a Data Bucket and further train a more accurate model, and more.

    With BLOCKS, you can create powerful and efficient Machine Learning systems and drastically reduce the need for programming your own infrastructure.

Creating data for Machine Learning

You can download the iris dataset for this guide from the University of California, Irvine website. We will use the file named bezdekIris.data.

About the data

The dataset consists of the following five columns. The rightmost column contains the class, or species, that we will predict for with our model.

  1. Sepal length in cm
  2. Sepal width in cm
  3. Petal length in cm
  4. Petal width in cm
  5. Class (the results variable)

The three species of iris we will classify are:

<em>Iris setosa</em>
Iris setosa
<em>Iris versicolor</em>
Iris versicolor
<em>Iris virginica</em>
Iris virginica

Creating the training data

We’ll use the dataset we downloaded to prepare our training data for the Model Generator. Trainings require two sets of data: a training set and a validation set.

Model Generator training data: separate training and validation sets

The Model Generator uses the training set to train a model, then tests its accuracy with the validation set. With these two sets of data, the Model Generator can automatically test its model and optimize itself while it trains.

We refer to this process of training a model in the Model Generator simply as a training, and we refer to the data we send to the Model Generator collectively as the training data.

You can prepare your training data as separate files for the training set and validation set (as shown in the image above), or you can select to have the Model Generator automatically split one training data file into the two sets (shown in the image below). If you choose to have the Model Generator split the data, it will do so at an approximately 8:2 ratio of training set to validation set.

Model Generator training data: automatically splitting one file

You must prepare training data as comma-delimited CSV files (UTF-8 without BOM). Prepare the data with columns for each input variable. Always make sure that you put the results variable (the class to predict for), in the rightmost column. The results value must contain numerical values. If you prepare separate files for the training set and validation set, make sure that both are formatted identically.

For this guide, we’ll separate the training data into separate training set and validation set files. We’ll keep the data formatted in the same way as the original downloaded file, which is as follows:

sepal length, sepal width, petal length, petal width, class
Using the Data Editor to create training data

One way to prepare your training data is to use a spreadsheet program like Excel or Google Sheets. Ultimately, you just need to prepare the data as a CSV file, so it’s fine to use this method if it’s the simplest for you.

However, if you want to set up an automated system for preparing data, or if you’re working with very large datasets, we recommend using the Data Editor in BLOCKS. This way you prepare the data, train the model, and make predictions all from BLOCKS.

We’ll follow these steps to prepare our data in BLOCKS:

  1. Use the GCS Explorer to upload the iris data to Google Cloud Storage (GCS).
  2. Import the iris data into the Data Editor.
  3. Perform data cleansing.
  4. Split the data into a training set and a validation set.
  5. Write the training set to GCS.
  6. Write the validation set to GCS.
Uploading the iris data to GCS

We can use the GCS Explorer in BLOCKS to easily upload the iris data into GCS.

Opening the GCS Explorer

Click the menu () icon in the global navigation bar, then click for the GCS Explorer (beta).

The GCS Explorer will open in a new tab.

Selecting a bucket in the GCS Explorer

Select your GCP service account.

Depending on your BLOCK payment plan, do the following for selecting your bucket:

  • Full Service Plan users:

    Select the bucket that ends with -data.

  • Self-Service Plan and Free Trial users:

    Select Create a new bucket.

    How to create a bucket

    You can name your bucket whatever you like, but you must set it to Regional and us-central1.

info_outlineWe’re using the bucket name blocks-ml-demo-data in this guide, but you should replace this with the name of your bucket when following the guide.

Next, we’ll make a folder inside the bucket that we’ll upload the iris data into.

Creating a folder in GCS (1)

Click Create Folder.

Creating a folder in GCS (2)

We’ve named the folder iris.

Selecting your folder from the list

Click on your folder from the list.

Confirming that you’re in the correct folder

Confirm that you are in the correct folder.

Uploading the iris data (1)

Click Upload File.

Uploading the iris data (2)

Select the iris data file you downloaded earlier from the dialog box (macOS version shown above).

Confirming that your file uploaded

It will take a moment for your file to upload. It should appear in the GCS Explorer once it finishes.

About using the Data Editor

info_outlineThe Data Editor is currently in beta, so the names of buttons and features may change in the future, or may be slightly different than those used in this guide.

The BLOCKS Data Editor makes use of BigQuery , Google’s big data processing service, to process data. BigQuery features the following:

  • No limit for data size
  • The world’s fastest processing speed
  • Very affordable pricing

With the Data Editor, you can perform powerful data processing without needing technical expertise regarding BigQuery. Beyond preparing training data for the Model Generator in this guide, we recommend trying out using BigQuery through the Data Editor.

We uploaded the iris data into GCS in the previous section, now we need to import the data into the Data Editor so we can process it. Since the Data Editor uses BigQuery to process data, by importing data into the Data Editor, it is also entered into a BigQuery table.

Importing the iris data into the Data Editor

Now we will import the data into the Data Editor.

Opening the Data Editor

Click the menu () icon in the global navigation bar, then click for the Data Editor (beta).

The Data Editor will open in a new tab.

Importing data into the Data Editor (1)

Click Import.

The Data Editor will switch to the import menu.

Importing data into the Data Editor (2)

Click Google Cloud Storage and select your GCS service account.

Importing data into the Data Editor (3)

Click the GCS URL field and select the iris data you uploaded.

Importing data into the Data Editor (4)

For the Schema section, select Edit as text. For convenience’s sake, we’ve prepared JSON text below that you can copy and paste into the schema entry field.

[
 {
  "name": "sepal_length",
  "type": "FLOAT",
  "mode": "NULLABLE"
 },
 {
  "name": "sepal_width",
  "type": "FLOAT",
  "mode": "NULLABLE"
 },
 {
  "name": "petal_length",
  "type": "FLOAT",
  "mode": "NULLABLE"
 },
 {
  "name": "petal_width",
  "type": "FLOAT",
  "mode": "NULLABLE"
 },
 {
  "name": "class",
  "type": "STRING",
  "mode": "NULLABLE"
 }
]

This JSON text expresses the formatting of the iris data as shown below:

Field contents Field name Data type
Sepal length sepal_length Float
Sepal width sepal_width Float
Petal length petal_length Float
Petal width petal_width Float
The correct class/species name class String
Importing data into the Data Editor (5)

Configure the settings as shown above, since the iris data we downloaded is a csv file with no header line, uses commas for the delimiter character, and contains some fields with missing values.

Importing data into the Data Editor (6)

Since the Data Editor stores the data into BigQuery, the last step is to configure the settings for the BigQuery destination.

Configure these settings as shown below:

Item Value
Dataset ID doc_samples
Table ID iris_init
Name Iris (initial data)

Click Import.

Data cleansing

Once the data has been imported, we need to check if there are missing or extra values in the data.

Click on your data’s name.

Click the button as shown in the image above to check if your data contains missing values.

The info_outline icon will be shown for any columns that contain missing values. Perform the steps shown in the image above for any such rows.

Since the iris data we are using has missing values in every column, repeat the above steps for each column.

The iris data doesn’t contain any extra values, so we don’t need to clean up extra values from our data this time.

At this point, the missing values have only been deleted in the Data Editor, and not in the actual data in BigQuery.

Click Save in the upper-right of the Data Editor to save your changes and have them reflected in the actual BigQuery table.

Enter the following informtion for BigQuery:

Item Value
Name Iris (cleansed data)
Dataset ID doc_samples
Table ID iris_cleansing

With that, your training data is ready to be used with for Machine Learning with a Model Generator.

Training the model

We’ll now create a Model Generator to train a classification model.

Creating a Model Generator

If you don’t have any Model Generators created, click Start on the What is the Model Generator? screen.

What is the Model Generator?

If you already have at least one Model Generator, click Add from the top of the Model Generator list.

The Model Generator list

info_outline A message will appear if your organization does not have sufficient licenses to create the Model Generator. If you are an admin for your organization, you will be prompted with instructions on how to to purchase an additional license. If you are not an admin, you will be prompted to contact your admins.

Select the Classification model.

Selecting the type of model

Enter a name for your Model Generator.

Naming a Model Generator

The following step is for Self-Service Plan users only

Select your GCP service account and enable any APIs that don’t have a checkmark.

Model Generator setup: GCP service account setting

The following step is for Self-Service Plan users only.

Once a training completes, its results will be stored into GCS. As such, you need to configure the GCS destination that will store the results.

This bucket must be set to the default storage class Regional and the location us-central1. We cannot guarantee that the Model Generator will function properly for buckets with other settings.

Model Generator setup: Storage settings

Now you need to configure settings for your training data. Enter each item’s name, type, and dimensions.

Training data settings for the classification Model Generator

Configure your training data settings as shown in the following chart:

  • Data items:
    Item name Type Dimensions Explanation
    sepal_length Numerical value 1 The sepal length
    sepal_width Numerical value 1 The sepal width
    petal_length Numerical value 1 The petal length
    petal_width Numerical value 1 The petal width
  • Class labels:

    In the section for the results value, enter the following for your class labels:

    • Iris-setosa
    • Iris-versicolor
    • Iris-virginica

Click Next to continue to the confirmation screen and create your Model Generator.

Model Generator (classification) confirmation screen.

Starting the training

Now we’ll use the Model Generator to start training a classification model.

Model Generator details screen

Click Start Training.

Start training
  • Enter a name for the training.
  • For the Training data upload method, select Automatically split one file into the training and validation sets.
  • For the Training data location, select Data Editor.
  • For the Training data table, click Select Table. Select the Iris (cleansed data) table and click Select.
  • Selecting a Data Editor table for a Model Generator training
  • If desired, you can enter a time for the Max. time until timeout (minutes).
  • Set the Max. number of trials. By setting more than one trial, the Model Generator can automatically adjust its parameters until it finds the most accurate tuning.
  • Select whether to Enable or Disable automatic early stopping. Enable this option to have the training stop before the end of the specified training time if the Model Generator determines that accuracy is unlikely to improve. While this can reduce unnecessary training time, please be aware that accurately determining early stopping cannot be guaranteed.
  • Set the Machine type as either Single node, Single node (GPU), or Distributed nodes.
    • Single node: Uses the standard machine to run the training.
    • Single node (GPU): Runs the training using a GPU (Graphic Processing Unit) for generally faster results than the standard single node type. However, GCP fees will cost approximately three times as much. Depending on the training data, the speed may not be significantly faster, or may be slower in some cases.
    • Distributed nodes: Runs the training using multiple machines.

Click Start to start training the classification model.

Training details screen

Applying the trained model

Once the training finishes, it’s status should change to Succeeded and an Apply button should appear. Select whether to apply the model to a Testing or Production setting and click Apply.

Please wait a short while after clicking the apply button before running any predictions, as running them immediately may cause errors in the current version. If an error does occur, wait another few minutes and try running your prediction again.

Making predictions

Making a predictive Flow

We’ll use the Flow Designer service to make predictions using our trained model with the following Flow:

Predictive Flow for classification

After the Start of Flow BLOCK, place a Construct Object BLOCK from the Basic section of the BLOCK List.

Set this BLOCK as follows to enter data for the irises we will make predictions for:

Property Value
Results variable _ (Left as the default setting)
Data
Construct Object BLOCK Data property settings

The data.0,data.1, and data.2 objects each contain data for one of the three types of iris the model can predict for. When using multiple sets of data, we combine them into an Array. To do this, click + next to Array to create elements for data.0, data.1, and data.2.

Set each of these elements to Object type. Use the + next to Object to add a set of data that includes an identifier (key), the sepal length (sepal_length), the sepal width (sepal_width), the petal length (petal_length), and the petal width (petal_width).

Next, add a Model Generator prediction (online) BLOCK from the Machine Learning category of the BLOCK List to the Flow. Set its properties as follows:

Property Value
GCP service account Select your GCP service account.
Model Select the model you trained with your Model Generator.
Version used for predictions Select whether to use the Production (current) or Testing (preview) of the model.
Input variable _.data
Output variable _out

There are many different ways to output the prediction results, but we’ll just load them into a BigQuery table using the Load to table from variable BLOCK from the BigQuery category of the BLOCK List. Set its properties as follows:

Property Value
GCP service account Select your GCP service account.
Source data variable _out.predictions
Destination dataset Enter the BigQuery dataset that you want to store the results.
Destination table iris_predict
Schema settings
key STRING NULLABLE
label STRING NULLABLE
label_index INTEGER NULLABLE
score FLOAT REPEATED

You can enter the schema settings more quickly by clicking Edit as JSON and copying in the following JSON text:

[
 {
  "name": "key",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "label",
  "type": "STRING",
  "mode": "NULLABLE"
 },
 {
  "name": "label_index",
  "type": "INTEGER",
  "mode": "NULLABLE"
 },
 {
  "name": "score",
  "type": "FLOAT",
  "mode": "REPEATED"
 }
]
In cases of non-empty tables Overwrite
File format (Within advanced settings) NEWLINE_DELIMITED_JSON

We’ll also add an Output to log BLOCK from the Basic category so we can quickly check our results directly from the Flow Designer.

Property Value
Variable to output _out.predictions

Making predictions

Once your Flow is ready, save your Flow Designer and click the button within your Start of Flow BLOCK’s properties.

Open the logs section to confirm the results of the prediction.

[
  {
    "label_index": 1,
    "score": [
      0.10034828633069992,
      0.5175866484642029,
      0.3820650279521942
    ],
    "key": "1",
    "label": "Iris-versicolor"
  },
  {
    "label_index": 2,
    "score": [
      0.040482282638549805,
      0.45981335639953613,
      0.49970442056655884
    ],
    "key": "2",
    "label": "Iris-virginica"
    },
  {
    "label_index": 0,
    "score": [
      0.9810823202133179,
      0.01643194630742073,
      0.0024857944808900356
    ],
    "key": "3",
    "label": "Iris-setosa"
  }
]

The outputted results contain a label and score and label_index for each key configured in the Construct object BLOCK.

label shows the predicted class (species of iris):

  • "Iris-setosa":Iris setosa
  • "Iris-versicolor":Iris versicolor
  • "Iris-virginica":Iris virginica

score shows the confidence level the model has for predicting the class. Scores range from 0 (no confidence) to 1.0 (100% confidence).

label_index shows which score is being used for the label. The values within score are ordered starting from 0.

Refer to Predicting with the Model Generator prediction (online) BLOCK for instructions on other methods of making predictions with the Model Generator prediction (online) BLOCK other than the one showcased in this guide.