Categorize data with the classification model

Categorize data with the classification model

Introduction

This tutorial will demonstrate how to use a deep learning-based classification model in BLOCKS. The classification model tries to classify data among a set of categories. Some example use cases include the following:

  • Predicting if people will or will not register for a membership campaign.
  • Predicting if visitors to a website will or will not click on an advertisement.
  • Predicting if voters will choose candidate A, B, C, or D.
  • Predicting if credit card usage is normal or fraudulent.

For this tutorial, we’ll use data like petal length and width for examples of three types of irises to train and test a classification model in BLOCKS.

Tutorial example overview

General overview

BLOCKS contains three machine learning-capable services: the Model Generator, the DataEditor, and the Flow Designer.

This tutorial will use the DataEditor and Flow Designer. The following is a general overview of the steps:

Overview of classification using BLOCKS
  1. Import the data (CSV file) into the DataEditor and split it into training and testing data.
    The CSV file contains sepal length, sepal width, petal length, petal width, and iris species data.
  2. Run a training from the DataEditor
    BLOCKS will “learn” the relationship between the features (sepal length, sepal width, petal length, and petal width) and the label (iris species).
    The results of the training is called a model or trained model.
  3. Test the model from a Flow Designer by predicting the iris species
    from among the following:
    • Iris-versicolor
    • Iris-setosa
    • Iris-virginica

Trying out the classification model

We’ll do the following to train and test a classification model for predicting iris species:

  1. Prepare a CSV file of the iris data.
  2. Use the DataEditor to split the data into training and testing sets.
  3. Train the model in the DataEditor.
  4. Prepare a Flow Designer.
  5. Create a flow for making predictions.
  6. Execute the flow.
  7. Check the results of the prediction in the DataEditor.

Good, properly formatted data is essential to machine learning and makes it possible to train models and make predictions. As such, the first step will be to gather and process our data to prepare to use it for machine learning.

We recommend using Google Chrome for this tutorial. BLOCKS also supports Firefox, but some steps in this tutorial may be slightly different if you aren’t using Google Chrome.

Preparing the data as a CSV file

The first step is to prepare the data that we’ll use to train and test the classification model as a CSV file (UTF-8, without BOM).

A CSV file with the iris data is available to download from the University of California Irvine website. For this tutorial, we’ll use the file titled bezdekIris.data, which you can download here by doing the following:

Downloading the initial iris data file
  1. Click bezdekIris.data to download the file.

The file contains the following five items:

  • Iris sepal length (cm)
  • Iris sepal width (cm)
  • Iris petal length (cm)
  • Iris petal width (cm)
  • Iris specis (Iris-versicolor, Iris-setosa, Iris-virginica)

Splitting the data in the DataEditor

We’ll now import the data into the DataEditor where we can split it into sets for training and testing the classification model.

During the training, BLOCKS will “learn” the relationship between things like petal length/width and the iris species.

Training data explanation

We refer to the input variables (sepal length, sepal width, petal length, and petal width) as features. We refer to the dependent variable (iris species) as the label, results variable, or class.

The training data will be organized as shown below with a column for each feature followed by the label as the right-most column. The data is already organized in this manner in the downloaded file.

Training data organization

The headers for each column can contain letters, numbers, or underscores (_).

During the testing step, we’ll use feature data not used to train the model to predict for labels.

Testing data explanation

The testing data will be organized with a key column and the feature columns. The key column contains values to identify each example (1 row) of features. Each value within the key column must be unique. In this example, we’ll use sequential numbers as our keys.

The testing data will be organized as shown below with a key column followed by the feature columns.

Testing data organization

The headers for each column can contain letters, numbers, or underscores (_). The header for the key column must be named key.

We’ll now import the iris data into the DataEditor and split it into the training and testing sets as shown below:

Overview of importing the iris data into the DataEditor and processing it

The iris data may contain some irregular values which we’ll cleanse using the DataEditor before splitting it into the training and testing sets.

Sign in to MAGELLAN BLOCKS if you have not done so already, then do the following:

Switching to the DataEditor
  1. Click the menu icon () in the global navigation bar.
  2. Click DataEditor.

Import the iris data into the DataEditor by doing the following:

Selecting to import data into the DataEditor
  1. Click Import.
Importing the iris data into the DataEditor
  1. Click Upload.
  2. Select the Google Cloud Storage (GCS) location to upload the file into. We’ll use a bucket that ends with -data.
  3. Drag and drop the bezdekIris.data file into the field or click and select the file.
  4. Change the number of skipped rows to 0.
  5. Click Additional options.
  6. Check Permit rows with insufficient fields.
  7. Select the dataset that will store the data or click to create a new dataset. We’ll use a dataset called tutorials.
  8. Enter a name to identify this data in the DataEditor. We’ll name ours Iris Classification Data.
  9. Click Import.

The DataEditor is a tool for visualizing and processing data that is stored in BigQuery. You don’t need to have any specialized knowledge of BigQuery to use the DataEditor. However, you do need to specify the BigQuery dataset and table that will store your data. If you are familiar with spreadsheet software, a BigQuery dataset could be compared to a workbook, and a BigQuery table to a single sheet.

Opening data that’s been imported into the DataEditor

Click Open.

The DataEditor summary page

The summary page for the data will open in the DataEditor.

Since the DataEditor automatically assigned names to the columns, rename them by doing the following:

Renaming columns
  1. CLick Table.
  2. Click Edit Column.
  3. Click Rename column.
Renaming columns (2)
  1. Enter a new name for the column (see below for a list of names).
  2. Click OK.

Starting from the left-most column, repeat the steps above and rename all of the columns with the following names:

  1. sepal_length
  2. sepal_width
  3. petal_length
  4. petal_width
  5. class

Cleanse the data of any irregular values by doing the following:

Checking for irregular values
  1. Click the for the sepal_length column.
  2. Click Show missing values.
Example of the symbol that indicates a column contains irregular values

A will appear for any columns that contain missing values (as shown in the image above). For this data, the DataEditor will find missing values in all of the columns.

We can’t use examples (rows) that contain missing values when training the model, so delete these rows by doing the following:

Deleting rows with missing values
  1. Click Edit Column for the sepal_length column.
  2. Click Missing values.
  3. Click Delete rows with missing values.

Repeat these steps for every column.

Next, we’ll add a key column that will be used in the testing data and when we compare the prediction results with the actual labels.

Adding a column
  1. Click Edit Table.
  2. Click Add column.
Adding a key column with sequential values
  1. Select Sequential.
  2. Enter the name key.
  3. Click OK.

This will add a column named key that contains serial numbers (sequential) to the left-most position.

The key column must be STRING type when making predictions, so change its type by doing the following:

Changing a column’s type
  1. Click Edit column for the key column.
  2. Click Change type.
Changing to STRING type
  1. Select STRING.
  2. Click OK.
Confirming all changes
  1. Click Changes.
  2. Confirm that all of the listed changes are correct.

Your list should contain the following:

  • Rename double_field_0 to sepal_length
  • Rename double_field_1 to sepal_width
  • Rename double_field_2 to petal_length
  • Rename double_field_3 to petal_width
  • Rename string_field_4 to class
  • Delete missing values in sepal_length
  • Delete missing values in sepal_width
  • Delete missing values in petal_length
  • Delete missing values in petal_width
  • Delete missing values in class
  • Add key
  • Cast key to STRING

You can click the × next to an item to revert that edit.

Save your changes by doing the following:

Saving the DataEditor
  1. Click .
  2. Click Overwrite.

Now that you’ve cleansed the initial iris data of irregular values, split it into the training and testing sets by doing the following:

Splitting the table
  1. Click Edit Table.
  2. Click Split table.

We’ll split the data randomly into the training and testing sets at an 8:2 ratio (the default setting in the DataEditor). Select which data columns to include in the resulting tables and split the initial data into the training and testing sets by doing the following:

Splitting the initial data into the training and testing sets
  1. Uncheck key to remove the key column from the training data.
  2. Uncheck class to remove the label column (iris species) from the testing data.
  3. Click Split.
The split confirmation dialog
  1. Click OK.

The training and testing sets are now ready.

Training the model from the DataEditor

You can use the training data you prepared to create a classification model by doing the following:

Switching to the training data
  1. Click Iris Classification Data_train
Configuring the training settings
  1. Click Create Model.
  2. Click Classification.
  3. Designate the GCS bucket and folder that will store your models. This step is only required for the first Model Generator you create.
  4. Click Create Folder. This step is only required for the first Model Generator you create.
  5. Click Create Model.
Closing the training started confirmation dialog
  1. Click Close.

It should take about 4–5 hours to train the model, depending on server circumstances. You can check on the training’s progress from the model list or from the model’s details screen.

Switch to the model list by doing the following:

Returning to the DataEditor home screen
  1. Click <.
Switching to the model list
  1. Click Models.
  2. You can view a progress bar for the training in the RMSE/Accuracy column. You can click the icon next to the progress bar to refresh it.
  3. Clicking the training’s name will open its details page.

A value will appear in the RMSE/Accuracy column when the training finishes.

Example of a completed training

In the next section, we’ll use a Flow Designer to test the model by using it to make predictions.

Making predictions from a Flow Designer

This section will explain how to use a Flow Designer to make predictions.

You can also run predictions from the DataEditor, however it does not support batch predictions or setting automated schedules for running predictions like the Flow Designer does.

For more details on making predictions from the DataEditor, refer to Creating models and making predictions in the DataEditor.

Preparing a Flow Designer

In this section, we’ll use the Flow Templates feature in the Flow Designer to make predictions using the trained model. If you don’t have any Flow Designers yet, create one by doing the following:

Switching to the Flow Designer
  1. Click the menu icon () in the global navigation bar.
  2. Click Flow Designer.
What is the Flow Designer?
  1. Click Start.

If you’ve already created a Flow Designer in your project, you’ll see the Flow Designer list instead of the page shown above. In this case, you can either use an existing Flow Designer or create a new one by clicking Add.

A message will appear if you don’t have enough licenses to create the Flow Designer. If you are an admin user in your organization, you’ll be given the option to purchase more licenses. Otherwise, you’ll be prompted to contact your organization’s admins.

Creating a Flow Designer
  1. Enter a name for the Flow Designer. We’ll use the name Tutorials.
  2. Select the language that will be used for log messages.
  3. Select your time zone.
  4. Click Create.
Opening a Flow Designer
  1. Click the name of your Flow Designer from the list to open it in a new tab.
Creating a flow for predictions

In this section, you’ll create a processing flow that can make predictions with the trained model by doing the following:

Clicking the Flow Templates button

Click Flow Templates.

Creating a classification prediction flow
  1. Click Numerical classification.
  2. Click Next.
Entering a name for the flow
  1. Enter a name for the flow. We’ll use Iris Classification Prediction.
  2. Click Next.
Configuring the prediction BLOCK
  1. Select the model you trained in the DataEditor. Ours is named Iris Classification Data_train.
  2. Click Online prediction.
  3. Click Next.
Configuring the input data for making predictions

Configure the flow to use the testing data you prepared in the DataEditor by doing the following:

  1. Select DataEditor.
  2. Select your testing data. Ours is named Iris Classification Data_test.
  3. Click Next.
Configuring how to output the prediction results

Configure to have the results sent to the DataEditor by doing the following:

  1. Select DataEditor.
  2. Click Next.
Configuring where to place the flow in the Flow Designer

Configure which Flow Designer tab the flow will be placed into by doing the following:

  1. Click Create to create the flow in the current tab.
Saving the Flow Designer
  1. Click Save to save your Flow Designer.
Executing the flow

With the Flow ready, you can execute it to make predictions by doing the following:

Executing the flow
  1. Click the menu icon () on the right side of the Iris Classification Prediction BLOCK.
  2. Click Execute Flow.
Viewing logs
  1. Click View Logs to open the logs panel and check the status of the flow while it executes.
Example of a log for a flow that is running

You can see the status of your Flow in the log list on the left side of the log panel. Its status will be Running (❶) while it executes. Wait for a bit of time until it finishes.

Example of a log for a flow that has finished executing successfully

The status will change to Finished (❶) if the flow finishes executing successfully.

Checking the results in the DataEditor

The results of the prediction will be sent to the DataEditor, so switch back to the BLOCKS tab in your browser and do the following:

Switching to the DataEditor
  1. Click the menu icon () in the global navigation bar.
  2. Click DataEditor.
Opening the results
  1. Click the reload icon if the results data isn’t the list.
  2. Click on your results data. Our is Iris Classification Data_test_result.
Viewing the data
  1. Click Table.
  2. Click View data.
Example of results data displayed in the DataEditor

The following chart explains the meaning of each column:

Name Explanation
key

The keys you configured for the prediction data.

In this example, the keys are unique, sequential numbers.

label

The class predicted for the features related to the corresponding key.

In this example, this is the iris species predicted for corresponding set of features (sepal length, petal width, etc.).

score

The confidence level for predicting the label shown as a value between 01 (1 indicates 100% confidence).

score_Irissetosa
score_Irisvirginica
score_Irisversicolor

A list showing the probability for each possible label with values ranging from 01 ((1 indicates 100% confidence).

We can now compare the predicted labels with the actual iris species for the same examples.

To do this, we’ll merge our data into a table that contains both the predicted labels and the actual labels from the initial dataset by doing the following:

Joining tables in the DataEditor
  1. Click Edit Table.
  2. Click Merge table.
Configuring the join

Select the table that contains the actual labels by doing the following:

  1. Click Select join table.
  2. Click Iris Classification Data.

Configure settings for how to join the data between the two tables by doing the following:

Configuring the join conditions
  1. Check the boxes for the key and label columns.
  2. Check the box for the class column.
  3. Click Add.
  4. Select key.
  5. Select key.
  6. Click Join.

Configure settings for the table that will contain the merged data by doing the following:

Configuring the table that will store the merged data
  1. Enter a name that will identify the merged data in the DataEditor. We’ll use Iris Classification Compare Results.
  2. Enter the dataset that will store the new table. We’ll use tutorials.
  3. Enter the table ID for the new table. We’ll use bezdekIris_compare.
  4. Click OK.
The table join confirmation dialog
  1. Click OK.
Opening the iris classification comparison data

Click Iris Classification Compare Results.

Viewing the comparison data
  1. Click Table.
  2. Click View data.
Comparing the predicted and actual data

You can now easily compare the predicted species (the label column) with the actual species (the class column) and evaluate the accuracy of your model.

The DataEditor can export your data as a CSV file which you can use with other spreadsheet and data analysis tools. To export your data, do the following:

Selecting to export a table
  1. Click Edit Table.
  2. Click Export Table.
Configuring the export settings
  1. Click Google Cloud Storage.
  2. Designate the GCS location that will store the exported file. We’ll use the bucket ending with -data.
  3. Click Export.
Downloading the exported data
  1. Click the file name to download the file.
  2. Click OK.

The DataEditor’s export feature saves the data into Google Cloud Storage, but you can also download a copy of the file to your PC by clicking the link in the confirmation dialog box.

In case of an error

If an error occurs during a training on the DataEditor, you can find the error logs by doing the following:

Clicking a training that has failed from the model list
  1. Click the name of the model whose RMSE/Accuracy column shows Failed.
Clicking on the training to open up its details screen
  1. Click the name of the training.
Finding and copying the error logs
  1. Click Error logs.
  2. Click Copy error logs to clipboard.

If an error occurs on the Flow Designer, you can find the error logs in the logs panel. If you need to copy the error logs to contact BLOCKS Support, do the following:

Checking errors in a Flow Designer
  1. Select the logs with the status Failed.
  2. Click Show error log details.
  3. Click the button to copy the logs.

Error messages in the Flow Designer are shown in red, but it’s often helpful to read the logs before and after them.

If you encounter an error that you cannot solve after several attempts, you can contact the BLOCKS Support by clicking your user icon in the global navigation bar and selecting Contact Us. Please copy the entire contents of your error logs—not just the red lines—and include these either as a text file or within your message. For errors in a Flow Designer, you should also export your Flow as a JSON file and include this as an attachment in your message to BLOCKS support.

For more details on contacting BLOCKS Support, refer to the Basic Guide: Contact Us page.

Summary

With BLOCKS, training a classification model and making predictions doesn’t require specialized machine learning knowledge. All you need to do is prepare your data. However, there are a few points to keep in mind when doing this. In order to be usable by BLOCKS, your data should be a CSV file with UTF-8 encoding (without BOM).

As with this tutorial, using machine learning to solve real business problems starts with gathering data. You may be able to use data your company already has, or you may need to take steps to gather or purchase new data. You then need to examine your data and determine which features to use, cleanse it of irregular values, and prepare it in the correct format for training a model. It's not an exaggeration to say that collecting and processing data makes up the better part of doing machine learning.