Model Generator How-To: Text Classification

Introduction

This tutorial explains how to use the Text Classification Model Generator in MAGELLAN BLOCKS to create a model that predicts which of three authors wrote given sentences.

Text Classification model example

This service is currently in beta. As such, some of the features and steps described in this guide may change in the full release. We appreciate feedback from users regarding bugs or ways to improve BLOCKS.

General overview of steps

To perform text classification in MAGELLAN BLOCKS, we will use a Model Generator and a Flow Designer.

Text classification in BLOCKS overview

First, we’ll train a model using a Model Generator. For this training, we upload example passages from each author into the Model Generator. Then, it “learns” the identifying features within each author’s text. We refer to a successful training as a model or trained model.

Next, we’ll use a Flow Designer to predict which author wrote sentences that weren’t used in the training data.

The training and prediction data must be prepared in advance.

Trying out text classification

Before starting

We recommend using Google Chrome for this tutorial. You can also use Firefox, but one feature (explained in more detail later in the tutorial) used is only available in Google Chrome.

You’ll need data for training the text classification model, as well as data for making predictions. As mentioned earlier, we will use a Model Generator in BLOCKS to train a text classification model, and we’ll need text data for that training. Then, we will need different text data that wasn’t used during the training to make predictions and test our model. We’ve prepared data that you can use to try out this tutorial.

Data Explanation
Sample Data

A set of text files to use for text classification machine learning. It contains separate folders with files to be used as training data and prediction data.

  1. Download the sample data

    Click the link above to download the sample data. The data folders are contained in a ZIP file.

  2. Extract the files

    Extract the data from the ZIP file. The extracted data should be organized as shown in the image below:

    Organization of sample data folder

    The training data text files are organized into separate folders by author. These folder names (the author names) will be used as the classification labels by the Model Generator. The text files must have the .txt extension.

    Since this tutorial will create a model that classifies texts by Herman Melville, Jane Austin, and Mark Twain, text files for examples of their writing are placed into folders named as follows:

    • herman_melville
    • jane_austin
    • mark_twain
  3. Upload the data to Google Cloud Storage (GCS)

    During the training, the Model Generator will read the training data from GCS. We’ll use the GCS Explorer tool in BLOCKS to upload the extracted folder (text_classification_sample) to GCS. Do the following to upload the folder:

    Sign in to BLOCKS.

    Opening the GCS Explorer
    1. Click the menu () in the global navigation bar.
    2. Click GCS Explorer (beta).
    Selecting the GCP service account and bucket
    1. Select a GCP service account
    2. Select the bucket to upload the data into. We are using a bucket that ends with -data.

    Users on the Self-Service Plan (including the free trial) users, can select to automatically create default buckets including a -data bucket when creating a BLOCKS project.

    If you do not have any buckets, you can create one from the GCP service accounts section of the project settings menu.

    Creating a bucket from the project settings menu

    You can select another bucket if you have already created one. However, its Storage Class must be Regional and its Location must be us-central1.

    Uploading a folder to GCS
    1. Click Upload Folder.

    The upload folder function of the GCS Explorer is not available when using Firefox. If you are using Firefox, you can use the create folder function of the GCS Explorer to create the same folders as the downloaded data in GCS, then upload the text files into those folders. For more information on uploading files in the GCS Explorer, refer to Uploading files to GCS.

    Selecting the folder to upload
    1. Select the text_classification_sample folder you extracted.
    2. Click Upload.
    Clicking upload on the confirmation dialog
    1. Click Upload.
    The data uploaded into GCS

    It will take a bit of time for the text_classification_sample folder to finish uploading.

    Once finished, your training and prediction data will be ready to use.

Create a Model Generator

In order to train our text classification model, we’ll need to create a Model Generator designed for use with our training data.

The steps for creating this Model Generator are as follows:

Opening the Model Generator
  1. Click the menu icon () in the global navigation bar.
  2. Click Model Generator.

A screen titled What is the Model Generator? will appear if you haven’t created any Model Generators.

The What is the Model Generator screen
  1. Click Start.

A message will appear if you do not have enough licenses to create the Model Generator. If you are an admin for your organization, you will see the license purchase screen, where you can purchase an additional Model Generator license to continue. If you are not an admin, you will need to contact your organization’s admins to request that they purchase a license.

The list of Model Generators in your project will appear if any have already been created.

The Model Generator list
  1. Click Add.

A message will appear if you do not have enough licenses to create the Model Generator. If you are an admin for your organization, you will see the license purchase screen, where you can purchase an additional Model Generator license to continue. If you are not an admin, you will need to contact your organization’s admins to request that they purchase a license.

Selecting the text classification model
  1. Click Text classification model (beta).
  2. Click Next.
Entering a name for the Model Generator
  1. Enter a name for the Model Generator.
  2. Click Next.

Free Trial and Self-Service Plan users should follow the directions on the screen to complete the following two steps:

  1. GCP service account settings
  2. Storage settings
Label settings
  1. Enter the labels for your data. For the data in this tutorial, we can enter herman_melville,jane_austin,mark_twain.
  2. Click Add ().
Confirming the labels for your data
  1. Confirm that the labels for your data were added correctly.
  2. Click Next.
Confirming the settings for your Model Generator
  1. Confirm that your settings are correct and click Finish.
Final confirmation dialog
  1. Click OK to finish creating the Model Generator.

Training a model

Now that the Model Generator is ready, we’ll use our training data to train a model.

Starting a new training
  1. Click Start Training.
Entering a name for a training
  1. Enter a name for the training.
  2. Click the folder icon ().
Selecting the training data folder in GCS
  1. Click the arrow icon () next to the bucket that contains your training data. We’ve used a bucket that ends in -data.
  2. Click the arrow icon () next to text_classification_sample.
  3. Click training/.
  4. Click Select.
Starting the training
  1. Click Start.
Checking on the status of a training

You can check on the status of a training while it is running by looking at the training list.

The training for this tutorial should take about four hours, depending on server circumstances.

Confirming that a training was successful

The status for the training will change to Successful if it finishes successfully.

Now we need to set the model we just trained as being ready for use for predictions. To do this, we will Apply the training.

Applying a training
  1. Click the drop-down arrow (arrow_drop_down)
  2. Click Production.
  3. Click Apply.

For more details about applying to production or testing, refer to the Model Generator Help page’s Training list section.

If your training fails, please try running it again. For help determining the reason for a training’s failure, refer to In case of an error.

Creating a Flow Designer

With the trained model ready, we can now use it to make predictions in a Flow Designer. We can use the Flow Templates feature of the Flow Designer to quickly create a Flow for text classification predictions.

Creating a Flow Designer
  1. Click the menu icon () in the global navigation bar.
  2. Click Flow Designer.
Creating a Flow Designer
  1. Click Start.

If you have already created a Flow Designer in your project, you will see the Flow Designer list instead of the “What is a Flow Designer?” screen. In this case, you can click on the name of an existing Flow Designer and use it for the rest of this tutorial. If you have enough licenses and want to use a new Flow Designer, you can click Add in the upper-left corner of the Flow Designer list.

Creating a Flow Designer
  1. Enter a name for the Flow Designer.
  2. Configure the language (for log messages) and time zone settings as necessary.
  3. Click Create.

Creating a Flow for making predictions

With the Flow Designer ready, we’ll use the Flow Templates menu to create a Flow for making predictions with the model we trained.

Opening a Flow Designer
  1. Click the name of the Flow Designer you will use.

Your Flow Designer will open in a new tab.

Click the Flow Templates button
  1. Click Flow Templates.
Selecting to create a text classification Flow
  1. Click Text classification prediction.
  2. Click Next.
Naming the Flow
  1. Enter a name for the Flow.
  2. Click Next.
Configuring the prediction BLOCK settings for the Flow
  1. Click on the Model Generator that you created for this tutorial. We used the name Text Classification Demo.
  2. Click Online prediction.
  3. Click Next.
Clicking the icon to select data from GCS
  1. Click the folder icon ( ).
Selecting the prediction data from GCS
  1. Click the arrow icon () for the bucket that contains your prediction data. We used a bucket that ends with -data.
  2. Click the arrow icon () for text_classification_sample
  3. Click prediction.
  4. Click Select.
Configuring the setting to use all the prediction data
  1. Add a * to the end of the GCS URL so that all the prediction data in the folder is used. For example: .../text_classification_sample/prediction/*
  2. Click Next.
Setting the output to the Data Editor
  1. Select Data Editor for the storage location.
  2. Enter a name for identifying the data in the Data Editor. (Example shown above: Author Prediction Results)
  3. Enter the dataset that will store the results. (Example shown above: tutorials_en)
  4. Enter the table that will store the results. (Example shown above: text_classification_tutorial_results
  5. Click Next.
Placing the Flow onto the Flow Designer
  1. Click Create.
Saving the Flow
  1. Click Save.

Make sure to click Save after creating the Flow. You won’t be able to execute the Flow to make predictions unless you save. If you close the Flow Designer tab or your web browser without saving, the Flow will be lost.

Making predictions

We can now use the Flow to make predictions.

Executing the Flow
  1. Click the menu icon (more_vert) on your Start of Flow BLOCK (We named this BLOCK Predict Author).
  2. Click Execute Flow.

We can view the Logs section to check on the status of our Flow as it executes.

Checking the logs from the menu
  1. Click View Logs.
Checking on the status of a Flow while it executes
  1. Confirm that the Flow’s status is Running.

The Flow will take a bit of time to run.

Confirming that a Flow executed successfully
  1. Wait until the status changes to Finished.

Once this happens, the Flow has successfully executed.

If the Flow fails to execute successfully, refer to In case of an error for help determining the cause of the error.

Checking the prediction results

When we used the Flow Template to create the Flow, we configured for the results of the prediction to be sent to the Data Editor. To check the results in the Data Editor, switch back from the Flow Designer to the BLOCKS tab. We had previously left it on the Flow Designer list page.

Opening the Data Editor
  1. Click the menu icon () in the global navigation bar.
  2. Click Data Editor (beta).
Opening the prediction results
  1. Click the name you configured for the results. We used Author Prediction Results.
Viewing the data
  1. Click View data.
The results of the prediction in the Data Editor

The following chart explains the meaning of each column:

Name Explanation
key

The GCS URL for a prediction text file.

label

The predicted label.

In this example, we configured our labels as the following:

  • herman_melville
  • jane_austin
  • mark_twain
score

The confidence level for predicting the label. This is shown as a number between 01 with 1 signifying 100% confidence.

score_herman_melville
score_jane_austin
score_mark_twain

The confidence level for each possible label. These are shown as numbers between 01 with 1 signifying 100% confidence.

You can export data from the Data Editor as a CSV file by doing the following:

Exporting results from the Data Editor
  1. Click the Table menu.
  2. Click Export table.
Configuring the export settings
  1. Select Google Cloud Storage for the export destination.
  2. Select the GCS folder that will store the CSV file. We used the one that ends with -data.
  3. Click Export.
Downloading the exported file
  1. Click on the file name to download the CSV to your PC.
  2. Click OK.

In case of an error

If an error occurs during the Model Generator’s training, you can find the error logs by doing the following:

Opening the training details screen
  1. Click on the name of the training whose status is Failed.
Checking the error logs
  1. Click Error logs.
  2. Click Copy error logs to clipboard (if you will contact BLOCKS Support).

If an error occurs while using the Flow Designer, you can find the error logs by checking the Logs panel.

Checking the log panel of a Flow Designer

Error messages are shown in red.

To determine the cause of an error, it’s often helpful to read the logs before and after the red error message.

If you encounter an error that you cannot solve after several attempts, you can contact the BLOCKS support team by clicking your user icon in the right side of the global navigation bar and selecting Contact Us. For errors in a Model Generator, please copy the entire contents of the error logs—not just the red lines—and include these as a text file when you send your message.

For errors in a Flow Designer, click the Show error log details checkbox in the Logs panel of the Flow designer, then copy the logs. You should also export your Flow as a JSON file and include this as an attachment in your message to BLOCKS support.

For more details on contacting BLOCKS support, refer to the Basic Guide: Contact Us page.

Summary

With BLOCKS, you just need to prepare text files into labelled folders to get started with text classification machine learning.

As a final note, the following are some things to keep in mind regarding text files you can use with BLOCKS:

  • Place text files for training into separate folders for each label (category) that the model will classify.
  • Only the first 2,000 characters of each file are used during the training. You can use files with over 2,000 characters, but any characters past the first 2,000 will be ignored when training the model. If you want to use characters after the first 2,000 in the training, you will need to split the text into multiple files with less than 2,000 characters each.
  • The text files should be UTF-8 without BOM.
  • The text files should have the .txt extension.