Basic Guide

Model Generator How-To: Text Classification

Model Generator How-To: Text Classification

The service featured on this page is currently in beta. As such, some of the features and steps described in this guide may change in the full release. We appreciate feedback from users regarding bugs or ways to improve BLOCKS.

This guide explains how to use the BLOCKS text classification Model Generator to predict which of three authors wrote pieces of text.

Text classification example

We’ll use a Model Generator to and a Flow Designer to accomplish this.

Text classification system overview
  1. With the Model Generator we will run a training to train a model using example text from each author (training data).
  2. Then, we will use a Flow Designer to predict which author wrote text samples that weren’t used during the training (prediction data).

You need to prepare your training data and prediction data in advance.

Getting started

If you don’t have a BLOCKS account yet, refer to the Trial Guide to set up a free trial account.

We’ve prepared sample training and prediction data that you can download and use with this guide. Follow the steps outlined in the following chart:

Data Explanation
Sample data

Text data formatted for use with the text classification Model Generator.

  1. Download the sample data

    Click the link on the left to download the sample data. The sample data is a ZIP file containing multiple folders and text files.

  2. Extract the ZIP file

    Extract the ZIP file. The extracted folders and files will be organized as shown in the image below.

    Extracted ZIP file organization

    The text files for the training data are placed into separate folders by author. The names of these folders (each author’s name) will be used as the classes that the model will predict for.

    The sample data for this guide contains folders named herman_melville, jane_austen, and mark_twain. Each folder contains text files of an author’s writing.

  3. Upload to Google Cloud Storage (GCS)

    During the training step, the Model Generator reads files stored in GCS. Refer to Uploading files to GCS and upload the text_classification_sample folder to GCS.

Sample Flow

This sample Flow uses the Model Generator prediction (online) BLOCK to make predictions.

  1. Download the sample Flow

    Click the link to the left to download the sample Flow.

  2. Import the Flow into a Flow Designer

    Refer to Importing and exporting Flows and import the sample Flow into a Flow Designer.

Training

We’ll use the Model Generator to train our model based on ten example text passages from each author.

  1. If you aren’t currently using any Model Generators, click Start.

    What is the Model Generator?

    info_outline A message will appear if your organization does not have sufficient licenses to create the Model Generator. If you are an admin for your organization, you will be prompted with instructions on how to to purchase an additional license. If you are not an admin, you will be prompted to contact your admins.

  2. If you are already using Model Generators, click Add at the top of the Model Generator list.

    The Model Generator list

    info_outline A message will appear if your organization does not have sufficient licenses to create the Model Generator. If you are an admin for your organization, you will be prompted with instructions on how to to purchase an additional license. If you are not an admin, you will be prompted to contact your admins.

  3. Select Text classification.

    Selecting the type of Model Generator
  4. Enter a name for the Model Generator and click Next.

    Entering a name for a Model Generator
  5. Those using the free trial or the Self-Service Plan will see the following screens. Follow the instructions on each.

    1. GCP service account settings
    2. Storage settings
  6. Enter labels for each class.

    Model Generator text classification type label setting

    In this example, we will classify texts from three authors and sorted the training files into folders named herman_melville, jane_austen, and mark_twain.

    To set labels for these classes, enter herman_melville,jane_austen,mark_twain into the labels field and click .

    Confirm that your labels were registered correctly and click Next.

Review your settings and click Finish to create you Model Generator.

Once finished, click Start Training.

Starting a training with the text classification model

Fill out the settings for the training as shown in the following chart:

Item Details
Training name

Enter a name for the training.

For example First training.

Training data upload
(Full Service Plan users only)

For Full Service Plan users, this shows the location in Google Cloud Storage (GCS) that the training data will be uploaded to. It is shown with the format gs://BUCKETNAME.

Clicking the link will open the Google Cloud Console in another tab where you can access GCS using the Google account you’ve registered in the GCP access section of your project settings.

You should have already uploaded the training data in the Getting started section, so you won’t need to use this link.

Document folder

Designate the folder that contains the training text documents.

For the sample data in this guide, we’ll use gs://my-bucket/text_classification_sample/training/.

Replace the my-bucket portion of the URL with the bucket name from your own GCS environment.

Max. minutes for training

Configure the maximum amount of time that the training will take.

We’ll leave this at the default value of 180 minutes (3 hours).

Explanation (optional)

If desired, enter an explanation for the training.

We’ll leave this setting blank this time.

Click Start to start the training. You can view the training’s status from the training list.

Text classification Model Generator details screen

Once complete, the training’s status will change to “Succeeded” and an Apply button will be shown on the right side. Select Testing for the Apply to:, then click Apply. This will allow you to make predictions using this model from a Flow Designer.

You can select to apply a trained model to Production or Testing. Production is meant for models that will be used for your actual business purposes, while Testing is for testing models before using them in a production environment. Since we’re just testing a model in this guide, we’re selecting Testing.

Predictions

We’ll use a Flow Designer to make predictions with the following Flow:

Sample Flow overview

The Model Generator prediction (online) BLOCK will perform the actual prediction. This BLOCK takes in prediction input data that has been stored into a variable and uses a Model Generator’s training results to perform predictions. In this guide, we’ll prepare the prediction input data with a Construct object BLOCK.

There are several ways to prepare input data besides using a Construct object BLOCK. Refer to Predicting with the Model Generator prediction (online) BLOCK for more details.

The following chart shows the property settings for each BLOCK in the Flow. Only important properties or those that have been changed from the default are shown.

BLOCK
(Category)
Property Value
Construct object
(Basic)
Results variable _
Data
Construct object BLOCK data property settings

(Click image to enlarge)

Be sure to replace the my-bucket portion of the URL with the bucket name from your own GCS environment.

Model Generator prediction (online)
(Machine Learning)
GCP service account If you have multiple GCP service accounts, select the service account you would like to use with this BLOCK here.
Model Select the Model Generator that you just created.
Version used for predictions Preview (testing)
Input variable _.data
Output variable _
Output to log
(Basic)
Variable to output _

Save the Flow, then execute it by clicking the execute button play_circle_outline within the Start of Flow BLOCK’s properties.

If successful, the Flow will output a log like the one shown below. You can view logs by clicking the Log bar at the bottom of the Flow Designer and then clicking on the relevant execution log.

{
  "predictions": [
    {
      "label_index": 2,
      "score": [
        0.09977061301469803,
        0.1609141081571579,
        0.7393152713775635
      ],
      "key": "Mark Twain",
      "label": "mark_twain"
    },
    {
      "label_index": 0,
      "score": [
        0.40951022505760193,
        0.3273720443248749,
        0.2631177306175232
      ],
      "key": "Herman Melville",
      "label": "herman_melville"
    },
    {
      "label_index": 1,
      "score": [
        0.06779564917087555,
        0.8964423537254333,
        0.0357620008289814
      ],
      "key": "Jane Austen",
      "label": "jane_austen"
    }
  ]
}

The prediction results for each text is a set that includes "label_index", "score", "key", and "label". In the example above, the three sets can be found in lines 3–12, 12–22, and 23–32.

See the following chart for explanations of "label_index", "score", "key", and "label".

Name Explanation
"key"

If you assigned a key to a set of input data, it will be shown here.

If you didn’t assign a key, this will show the GCS URL for the text file.

"label"

The class that the model predicted (a folder name from the training data).

For this guide’s model, the three possible labels are:

  • "herman_melville"
  • "jane_austen"
  • "mark_twain"
"score"

The certainty for predicting each type. It separates the score for each label with commas.

"label_index"

Shows a number indicating which "score" from the list of scores is the prediction result.

The list of scores starts at index 0.

We’ve also rewritten the results into the following chart:

Input text file Prediction results
The Adventures of Huckleberry Finn (chapter 11)

About 73.93% certain that the author is Mark Twain.

Moby Dick (chapter 11)

About 40.95% certain that the author is Herman Melville.

Pride and Prejudice (chapter 11)

About 89.64% certain that the author is Jane Austen.

Summary

Training a text classification model with BLOCKS is as simple as separating text files into folders.

As a final note, here are some considerations to keep in mind when preparing your text files:

  • Separate text files for training into separate folders for each class.
  • During training, the Model Generator will only use the first 2,000 characters of each text file. You can use files with over 2,000 characters, but any characters after 2,000 will not be used in the training. If you want to use all of the characters, split the text into multiple files with no more than 2,000 characters each.
  • Text files should be UTF-8 without BOM.