Model Generator Help

Model Generator Help

Introduction

This page explains each of the steps involved in creating Model Generators and running trainings. Each Model Generator setup screen contains a link to a section of this page.

What is the Model Generator?

The Model Generator is a MAGELLAN BLOCKS service designed to make Machine Learning simple and accessible for everyone.

You can divide Machine Learning with BLOCK into two main steps: training and prediction.

  1. For the training step, you can use the Model Generator. It “learns” from past or known data to create a trained model.

  2. For the prediction step, you use the trained model is to make predictions based on new data.

    You can do this on a Flow Designer using either the BLOCK or the Model Generator prediction (batch) BLOCK.

Within the Model Generator, we refer to the training step simply as a training. During a training, the Model Generator uses data split into a training set and a validation set to train and optimize its model. You can create training and validation sets by splitting up your data beforehand, or have the Model Generator do this for you automatically.

The Model Generator uses the Google Cloud Machine Learning Engine to implement its Machine Learning functions.

Types of Model Generator

There are several types of Model Generator for various Machine Learning models. Each type currently available is explained below:

Classification

The classification model predicts which category data belongs to. Some example use cases include:

  • Categorizing which people are likely to not register (0) or register (1) for a membership campaign.
  • Categorizing which people are likely to not click (0) or click (1) an ad on a website.
  • Categorizing users into various groups (0 to n) based on their interests.
  • Categorizing whether credit card or ATM usage is normal (0) or fraudulent (1).

Regression

The regression model predicts a numerical value, often based on input data like weather, day of the week, and the like. Some example use cases include:

  • Predicting the number of attendees to an event.
  • Predicting the number of sales of a product.
  • Predicting the number of passengers.

Feature vector extractor (alpha)

This type is similar to the classification model, but it outputs vectors expressing the data’s features.

This service is currently in alpha and models made during the alpha period may become unsuable after the official version is released. Please use the official version once it is released.

As an alpha release, there’s the chance that some things may not work as intended. We appreciate your feedback regarding bugs or ways to improve MAGELLAN BLOCKS.

If you would like to request access to the limited alpha for this service, please contact us.

Image classification

The image classification model categorizes images based on their content. Some example use cases include:

  • Categorizing images of dogs as bulldogs, poodles, or dachshunds.
  • Categorizing images of cats as Persians, Siamese, or Scottish folds.
  • Categorizing images of sushi as tuna, salmon, or shrimp.

Object detection: manual setup

This model detects what objects are contained within images. It can detect multiple objects at the same time.

Object Detect example

For this type of model, you apply labels for objects contained in your training images.

GCP service account settings

This section only applies to Self-Service Plan users.

In this section, you will select your GCP service account and enable APIs required to use Google Cloud Machine Learning.

GCP service account settings screen

Select a GCP service account

The Model Generator creates and uses an environment within your GCP project. As such, it needs to be given permission to access your project. This is done using a GCP service account.

Enable APIs

Do the following if there are any APIs that do not have a checkmark () next to their Check button:

  1. Click Enable All APIs.
  2. The GCP console will open.
  3. Click Continue in the GCP console.
  4. Once the message The APIs are enabled appears, close the GCP console and return to BLOCKS.

Click Check for any API that does not have a checkmark (). Confirm that a checkmark appears for the API. Repeat this process for any APIs without a checkmark.

If the checkmark does not appear, try clicking Check again and waiting for a moment. This can sometimes take a bit of time. If the checkmark still does not appear, repeat the following:

  1. Wait a moment.
  2. Click Check.

If you still see a , the issue could be one of the following:

  • Your GCP service account is not set to the Editor role:

    Open the menu () in the upper-left corner of the GCP console. Hover your cursor over IAM and admin and click IAM. Confirm that your service account’s role is set to Editor. If not, change it to editor.

  • Billing is not enabled for your GCP project:

    Open the menu () in the upper-left corner of the GCP console. Click Billing. If you haven’t already done so, enable billing for your project.

Storage settings

This section applies to Self-Service Plan users only.

In this section, you will configure settings for the Google Cloud Storage (GCS) bucket and directory that will contain your training results.

Storage settings

Select GCS bucket

Prepare a bucket for the Model Generator to use, then select this bucket from the list. Please use the following settings when creating buckets to use with Model Generators:

Option Value
Default storage class Regional
Location us-central1

Setting the directory within the GCS bucket

Each Model Generator you create will use a different GCS directory within your bucket. You don’t need to create these directories in advance. The directory will be created with the name designated in this setting.

Training data settings

You will configure settings for the training data on this screen.

Training data settings

Prepare your training data as CSV files (UTF-8 with no BOM ) with commas as the field delimiter character.

  • Training data should consist of a set of at least one input variable and a results value. In the classification model’s case, the results value refers to the “answer value”. It refers to the “actual result” for the regression model.
  • Align each row as a set of input variables and the results value.
  • Order each row with the input variables first, followed by the results value.
  • The results value is a string for the classification model, and a numerical value for the regression model.
  • If you prepare separate files for the training set and validation set, both files must be formatted in the same way.
  • If you prepare your training data as one file, BLOCKS can automatically separate it into the training set and validation set using an approximately 8:2 ratio.

Input variable settings

Click Add another item and enter information such as name and type for each input variable in your training data.

When using training data created in the DataEditor, you can click Auto Set From Table and select your data to have BLOCKS automatically set your input data settings. It can also automatically set the keyword list for string (enum) type data. It will use the “Approx. number” setting if the data contains over 10,000 different types of strings.

Setting Explanation
Order

You can reorder a data item you have entered by clicking or .

Name

Enter a name for the data item. You can use letters, numbers, or underscores (_).

If you will use BigQuery for your input data or plan to use Flow Templates, there are certain reserved words in BigQuery that you should avoid using here. Refer to the BigQuery documentation’s Standard SQL Lexical Structure: Reserved Keywords page for a list of the reserved words.

Type Select the data type as numerical value, month, day of the week, Boolean, string (enum), or sequence. See the table below for more information on each data type.
Type-specific settings

There are additional settings required depending on the selected data type.

  • Numerical value: Dimensions
  • String (enum): Keyword list, approx. number
  • Sequence: Number of elements (length), dimensions per element (channels)

See the table below for a more detailed explanation of each setting.

Delete

Delete the data item by clicking .

Data types:

Type Explanation
Numerical value Integers or decimal numbers. You can also configure the number of dimensions when this type is selected. Dimensions refer to the number of numerical values within an item. For example, if you had the following data: 98,1.3,0,"A" and wanted to treat the 98,1.3 portion as one item, you set the dimensions setting to 2.
Month Integers indicating the month. The range can be either 0–11 or 1–12.
Day Integers indicating the day of the week. The range is 0–6.
Boolean

Select this type for data with two possibilities, such as true or false. Data in this type must be entered as the numbers 0 or 1.

This type can be used in situations like the following:

  • Is it a holiday or not? (holiday = 1, non-holiday = 0)
  • Is there an event? (yes = 1, no = 0)
  • Is the person single or not? (single = 1, not single = 0)

The way the above use 0 and 1 is just an example. You could also switch the values (ex. holiday = 0, non-holiday = 1).

Strings (enumerated)

String data. You can further select between a Keyword list or Approximate number.

Strings (enumerated) type settings Explanation
Keyword list Select this option when you know and can list the strings in the item. Enter the strings as a list separated by commas.
Approximate number Select this option when you don’t know all of the of the strings in the item. Enter the approximate number of different strings you expect there to be.

Space characters ( ) and blanks (cells not containing any characters or spaces) are considered as different data. Also, A , A, A , and A and similar cases with spaces included before or after characters are also each considered as different. Make sure to remove any such space characters if you want all of these to be handled in the same manner.

Sequence

Select this type for a list of numerical values whose order has significance.

Only one sequence can be set per training data.

Assume that we want to train a model to make predictions based on the previous week’s climate data (minimum temperature, maximum temperature, precipitation) as shown in the image below.

Sequence type overview

In this case, we use the sequence type because the ordering of the climate data sets (7 days ago, 6 days ago… 1 day ago) has significance.

Next, we set the “Number of elements (length)” and the “Dimensions per element (channels)” for the sequence.

The “Number of elements” is the number of items that have significance in their order. This would be set to 7 for our example, since there are 7 days of climate data in our sequence.

Number of elements example

“Dimensions per element” refers to the number of types of data within each element. In our example, we have 3 types of climate data in each element.

Dimensions per element example

This example uses a chronological sequence, but that is not a requirement. Any sequence of numbers for which the order has significance can be used.

The data used in this example would be written in a CSV file as shown in the image below.

CSV example

Class labels setting

When creating a classification model, you must enter a list of the results values, or class labels.

Enter a label into the field and click Add. You can enter multiple labels at once as a comma-separated list (ex: Iris-setosa, Iris-versicolor, Iris-virginica).

If you used Auto set from table to set your data items, you can click the menu icon () then click Auto set from table to automatically set your class labels. If you had already entered labels, this will overwrite them.

Click next to a label to delete it.

Output dimensions

When creating a regression model, you must configure the output dimensions of the results value.

As an example, if your model predicts for geographical location data with latitude, longitude, and altitude, the output dimensions would be 3.

Feature vector dimensions (alpha)

When creating a feature vector extractor model, you must configure the number of dimensions (1–100) for the vectors of the data’s features that will be output.

This service is currently in alpha and models made during the alpha period may become unsuable after the official version is released. Please use the official version once it is released.

As an alpha release, there’s the chance that some things may not work as intended. We appreciate your feedback regarding bugs or ways to improve MAGELLAN BLOCKS.

If you would like to request access to the limited alpha for this service, please contact us.

Using JSON text to enter training data settings

You can enter your training data information directly as JSON text.

Under Data items, click Edit directly by JSON and enter your JSON text into the field that appears.

Click Edit by form to switch back to the GUI for entering data items.

JSON text for training data should follow these specifications:

  • Enter training data information as an array ([item 1, item 2, ..., item n]). Order each item in the array in the same order as the columns in the training data CSV file.
  • Enter each data item as an object ({...}). Follow the guidelines below for setting each member of the object:
    • Name: Set as "name": "ITEM NAME".
    • Type: Set as "type": "TYPE". The "TYPE" can be one of the following:
      Type Value set for "TYPE" Type-specific additional data
      Numerical value

      "number"

      Set dimensions as "count": DIMENSIONS

      Month

      "month"

      None

      Day of the week

      "weekday"

      None

      Boolean

      "boolean"

      None

      String (enum)

      "enum"

      Set a keyword list as "keys": "KEYWORD LIST".

      Set an approximate number as"size": APPROX. NUMBER.

      Sequence

      "conv1d"

      Set the number of elements (length) as "length": NUMBER OF ELEMENTS (LENGTH).

      Set the dimensions per element (channels) as "channel": DIMENSIONS PER ELEMENT (CHANNELS).

The following is an example of JSON text for training data settings:

[
  {
    "type": "number",
    "name": "item1",
    "count": 1
  },
  {
    "type": "month",
    "name": "item2"
  },
  {
    "type": "weekday",
    "name": "item3"
  },
  {
    "type": "boolean",
    "name": "item4"
  },
  {
    "type": "enum",
    "name": "item5",
    "keys": "A, B, O, AB"
  },
  {
    "type": "enum",
    "name": "item6",
    "size": 100
  },
  {
    "type": "conv1d",
    "name": "item7",
    "length": 3,
    "channel": 1
  }
]

Label settings

For the object detection: manual setup Model Generator, you must label your training images before running the training.

For example, the labels dog and cat have been added to the following image:

Labelled image example

In this section, you will register the labels that you will use on your training images. For the example image above, you would register the labels dog and cat.

To register labels, enter them into the field and click . You can enter a comma-separated list (example: dog, cat) to add multiple labels at the same time.

You must adhere to the following naming rules for labels:

  • Use only letters, numbers, or underscores (_)
  • Must be within 64 characters

To delete a label, click the trash icon () to its right.

You cannot add or delete labels after a Model Generator has been created.

Training details screen

The following can be performed on this screen:

The Model Generator details screen

Managing image labels

For the object detection: manual setup model, you need to apply labels to your images before starting the training.

Applying labels to an image refers to creating rectangular boundaries with names (labels) around the objects in the image. You can apply labels for multiple objects within a single image.

Example of applied labels

You should prepare the training images that you will label into a single, flat GCS folder (cannot contain subfolders).

error_outline You should use at least 100 labelled training images. Using fewer training images may cause the training to fail.

How to label images
  1. Click Add Label.
  2. Choose the GCS folder that contains your images and click Select.

    Selecting the GCS image folder
  3. Next, the label application screen will appear.

    The label application screen
  4. On the label application screen, select an image and label, then drag your mouse to create a boundary around the corresponding object in the image.

    Labelling an object
    • Delete a boundary by clicking the × in its upper-right.
    • Resize a boundary by dragging the square marker in its bottom-right.
    • Move a boundary by dragging from within it.

    info_outline You can also add more labels to the label list by clicking the button. Enter the new label into the field that appears and click Add.

    Editing label boundaries
  5. Click Save to save the applied labels.

    Files you have labelled will have a checkmark (check_circle) in the file list.

  6. Repeat this process for all of the training images in the list.

  7. Once finished, click Close.

Start a training

Click Start Training to create a new training.

You can configure the required settings for a new training and start it from this screen.

Training list

A list will be displayed here when you have created at least one training. The following information is contained in the training list:

Information Explanation
Training name Configured in the Start Training settings. You can view the details screen for a training by clicking on its name.
Started Shows the date and time when the training began.
Finished Shows the date and time when the training finished.
Status Shows the training’s status. Possible statuses are: Preparing, Running, Succeeded, Failed, and Stopped.
RMSE/Accuracy Shows the results of evaluating the training. For the classification model, this refers to its accuracy in selecting the correct class. For the regression model, it refers to its RMSE (Root Mean Square Error), which is a measure of the difference between the actual results and the predicted results.
Actions

Stop a training that is running or apply a completed training to use for making predictions.

When you apply a training, you can use it for making predictions in a Flow Designer with the Model Generator prediction (online) BLOCK or the Model Generator prediction (batch) BLOCK.

You can select to apply your training to Production or Testing. With this option, you can apply the Testing version to verify your model, then apply the Production version for your actual implementation.

You can only apply one Production or Testing version at a time. You can also switch between which version is applied (Production to Testing and vice versa).

Click the icon on the right-side of a training to open an options menu.

Opening the training options menu

You can do the following from this menu:

  • Start duplicate training:

    Start a new training using the same settings as the selected training. You can edit these settings before starting the duplicate training.

  • Extend the training time:

    Takes the best results from the selected training and restarts training from there.

    This copies the selected training and creates a new training that starts by using the first training’s best results. You can configure all settings for the new training except for the Max. number of trials which is set to 1.

    info_outline This feature can only be used for trainings that ended due to reaching their maximum training time alloted or from the early stopping setting. You cannot click this option for trainings that do not have this option available.

  • Delete training:

    Deletes the selected training.

Show training details

You can view the details screen for a training by clicking on its name.

Item Explanation
Training results

View the training start time, end time, status, and accuracy/RMSE.

The image classification type may also show Incorrectly classified images.

When training, the image classification model automatically splits the data in the image folders with an approximate ratio of 8:2 for training and validation data.

It uses the training data to learn, then tests itself using the validation data. If its classifications are incorrect during this validation phase, the mistakes will be shown under Incorrectly classified images.

If Incorrectly classified images is displayed, please check that there are no problems in the images themselves. If there are, make sure the images are in the correct folder or delete the images with errors and try running a new training. If the incorrectly classified images do not contain any problems, you can leave them as they are.

Settings

Shows the settings you configured in when starting your training as well as the training’s runtime version. Please be aware that runtime versions have an expiration date. Refer to usage limits for more details.

Error logs Shows the error logs when a training fails. Refer to these if you want to examine why the training failed.

Confirm the Model Generator’s settings

You can view various information about a Model Generator under the Settings section of its details screen.

Item Explanation
Service name The name you configured on the Create a new Model Generator screen.
Type The model type you selected on the Create a new Model Generator screen.
GCP service account The GCP service account used by the Model Generator.
Model name The model name for the training results.
Training data settings The information configured on the Training Data Settings screen.

Delete a Model Generator

You can delete a Model Generator by clicking Delete at the bottom of its details screen.

Start training screen help

You can configure settings necessary to create and start a new training on this screen.

Classification, regression, and feature vector extractor (alpha) models

Start training screen

Each setting is explained below:

Setting Explanation
Training name Assign a name to the new training
Upload training data

This shows the GCS location where your training data will be saved as gs://BUCKETNAME.

Click the link to open the Google Cloud Console in another tab where you can access this GCS location. You will need to sign in using a Google account registered into the GCP access section of the project settings screen.

This is displayed for Full Service Plan users only.

Training data upload method

Select which method you will use for uploading the training data.

  • Automatically split one file into the training and validation sets:

    Select this method if you have prepared your training data as one file, or if you prepared it using the DataEditor in BLOCKS. BLOCKS will automatically separate it into the training set and validation set using an approximately 8:2 ratio.

  • Prepare separate training set and validation set files:

    Select this method if you have prepared separate CSV files for the training set and the validation set.

Training data location

Designate whether your data is stored as CSV files in GCS or as a table in the DataEditor.

Training set URL/Training data URL

Designate the GCS URL for the training set or training data (gs://bucketname/filename.csv).

These URLs should only contain ASCII letters, numbers, underscores (_), hyphens (-), or slashes (/).

Clicking opens a GCS file menu. Select your file and its URL will be entered in the correct format.

Validation set URL

Designate the GCS URL for the validation set (Example: gs://bucketname/filename.csv).

These URLs should only contain ASCII letters, numbers, underscores (_), hyphens (-), or slashes (/).

Clicking opens a GCS file menu. Select your file and its URL will be entered in the correct format.

Training table

Select the DataEditor table that contains your training data.

If your training table contains a column named key, BLOCKS will omit this column when running the training.

Number of CSV header rows to skip

Use this option to skip header rows in the CSV files used in your training data. If there are header rows, enter the number of header rows to skip. Enter 0 if there are no header rows in your training data files.

Max. time until timeout

In order to get the most accurate possible results, the Model Generator can run multiple training trials.

This property configures the maximum amount of time the Model Generator will spend on each trial.

Set this to 0 if you do not want to set a limit.

If the training’s results (accuracy/RMSE) start to deteriorate, trials will be stopped prior to reaching the “Max. time until timeout”.

Max. number of trials

Set the number of trials that the Model Generator will run as a number 1 or greater.

The approximate training time can be calculated as (Max. time until timeout) × (Max. number of trials). The actual time may be a bit longer due to additional/indirect processing times.

Enable early stopping

Select whether to Enable or Disable detection for early stopping.

With this option enabled, BLOCKS will stop the training before the end of the specified training time if it determines that accuracy is unlikely to improve. While this can reduce unnecessary training time, please be aware that accurately determining early stopping cannot be guaranteed.

Machine type

Select the type of machine to use for the training.

  • Single node

    The single node type uses a single worker instance with the base processing speed.

  • Single node (GPU)

    Runs the training using a GPU (Graphic Processing Unit) for generally faster results than the standard single node type. However, GCP fees will cost approximately three times as much.

    Depending on the training data, the speed may not be significantly faster, or may be slower in some cases, compared to the standard single node type.

  • Distributed nodes

    Uses multiple machines to run the training.

Explanation

Write an explanation for this training. (optional)

If you are using data from the DataEditor, the name, dataset ID, and table ID will be automatically entered.

Image classification model

Image classification Model Generator: Start training screen

Each setting is explained below:

Setting Explanation
Training name Designate a name for the training.
Upload training data

A link to a Google Cloud Storage (GCS) location to use for uploading training data is displayed here. The URL's format is gs://BUCKETNAME.

Clicking the link will open the Google Cloud Console in another tab. From there you can access this GCS location. You will need to log in using the Google account registered in the GCP access section of the BLOCKS project settings screen.

This is displayed for Full Service Plan users only.

Image folder

Designate the URL of the GCS folder that contains your training images (gs://BUCKETNAME/FOLDER/). The URL must have a / as its last character.

Clicking opens a GCS file menu. Select your folder and its URL will be entered with the format explained above.

The image folder should follow these specifications:

Image folder specifications
  • Prepare a folder in GCS that will contain your training images. We will refer to this as the image folder.
  • Directly in the image folder, prepare folders for each category. These folder names are used as the names for each class.
  • Store your image files into these folders based on their class.
  • Supported image file types depends on the training type:
    • Prioritize speed: JPEG only
    • Prioritize accuracy: JPEG, PNG, GIF, or BMP
  • Images with extreme aspect ratios may not be classified properly.
  • Each class folder should only contain images. Do not use subfolders.

For example, if classifying images of dogs and cats, put pictures of dogs into one folder and pictures of cats into another.

Training type

Select the type of training.

  • Prioritize speed

    This type takes less time to train the model, but may not get the most accurate results.

  • Prioritize accuracy

    This type takes a longer time to train the model, but usually gets more accurate results.

Max. time until timeout (minutes)

Configure the maximum amount of time that the training will take.

Max. number of trials

Set the number of trials that the Model Generator will run as a number 1 or greater.

The approximate training time can be calculated as (Max. time until timeout) × (Max. number of trials). The actual time may be a bit longer due to additional/indirect processing times.

Enable early stopping

Select to either Enable or Disable the early stopping option.

Enable this option to have the training stop before the end of the specified training time if it determines that accuracy is unlikely to improve. While this can reduce unnecessary training time, please be aware that accurately determining early stopping cannot be guaranteed.

This option is only available if you have selected Prioritize accuracy for your Training type.

Machine type

Select the type of machine that will run the training.

  • Single node (GPU)

    Runs the training using a single machine with a GPU.

  • Distributed nodes (GPU)

    Runs the training using multiple machines with GPUs. This type usually has faster processing speed that the Single node (GPU) type, but GCP usage fees will also increase.

    The Distributed nodes (GPU) machine type is only available if you have selected Prioritize accuracy for your Training type.

Explanation (optional) Enter an explanation for the training.

Object detection: manual setup

object detection: start training screen

The settings are as follows:

Setting Explanation
Training name Designate a name for the training.
Image folder

Designate the GCS folder that containg the labelled training images.

error_outline You should use at least 100 labelled training images. Using fewer training images may cause the training to fail.

Explanation (optional) Enter an explanation for the training.

GCP service charges

The Model Generator creates an environment in the user’s GCP project that uses various GCP services.

As such, GCP service charges will apply separately from MAGELLAN BLOCKS fees. Applicable charges will vary depending on the service. For details, refer to the pricing page for each of the services used by the Model Generator.

この情報は役に立ちましたか?