Datasets

See also

User Guide:

Advanced techniques Predicting Training: CNN Training: analyze and adjust Training: tree ensemble

Reference:

MclassAlloc MclassControl MclassControlEntry MclassDraw MclassExport MclassImport MclassInquire MclassInquireEntry MclassPredict MclassSplitDataset

Availability

Not available in MIL-Lite

Available in MIL

Datasets are the lifeblood of classification. A well trained classifier is only possible if the datasets that train it have the proper amount of data that proportionally represents all the variations in the classification problem you want to solve. Investing time into building your datasets, which includes thorough data collection, will make your classifier more robust, and in general, save you time in training.

Note, having a proper set of labeled data is a prerequisite for using the MIL Classification module (and, in general, for using supervised classification technologies).

Steps to build the datasets

The following steps provide a basic methodology for building the datasets:

Allocate a dataset context to hold all of your data (source dataset), using MclassAlloc() with M_DATASET_IMAGES or M_DATASET_FEATURES.
Allocate a dataset context for training the classifier (training dataset), using MclassAlloc() with M_DATASET_IMAGES or M_DATASET_FEATURES.
Allocate a dataset context for evaluating the training's performance (development dataset), using MclassAlloc() with M_DATASET_IMAGES or M_DATASET_FEATURES. The development dataset is optional when using a tree ensemble classifier.
Optionally, allocate a dataset context to perform a final validation of the trained classifier (testing dataset), using MclassAlloc() with M_DATASET_IMAGES or M_DATASET_FEATURES.
Populate the source dataset context.
1. Add class definitions to the source dataset, using MclassControl() with M_CLASS_ADD. Note, the number of classes with which to categorize your data is a key decision to make when using the MIL Classification module.
  
  Optionally, you can specify settings to help manage class definitions, using MclassControl(). For example, you can assign a color (M_CLASS_DRAW_COLOR) and an icon image (M_CLASS_ICON_ID) to class definitions. This allows you to draw and visually identify them with MclassDraw().
2. Add entries to the source dataset, using MclassControl() with M_ENTRY_ADD.
  
  To specify the location from which to get the entry's data (images or features), use MclassControlEntry() with M_FILE_PATH (for images) and M_RAW_DATA (for features).
3. For each entry, specify the class definition that is represented, using MclassControlEntry() with M_CLASS_INDEX_GROUND_TRUTH. This step is also known as labeling your data.
Split the source dataset context to establish the training dataset context and the development dataset context, using MclassSplitDataset(). You can also use this function to establish the testing dataset context.

If you are training a tree ensemble classifier without a development dataset context, you need not split your data; the source dataset context is your training dataset context.

If you want to augment your data, you should first split your data into the required datasets, and then add the augmented entries to the training dataset only. For more information, see the Augmentation subsection of the Advanced techniques section later in this chapter.

Note, you can use MclassImport() to import previously defined and exported datasets (from a CSV file), or use MIL CoPilot to create, label, modify, and export datasets interactively.

Proper data collection, classes, and entries

The data (images or features) in your datasets must properly represent what you are trying to classify. The quality and quantity of your data directly impacts the classifier's training. To help ensure proper data, collect it using the final imaging setup (the same camera, lens, and illumination) and use real samples (images) of the subject matter, or as close as possible to them.

The images that you train with (the images associated to dataset entries) must all have the same size. The sizes that you can use depend on the CNN classifier that you specified. For more information, see the Input image sizes subsection of the Training: CNN section later in this chapter. If you are using a tree ensemble classifier, the number of features associated to every dataset entry must be the same.

Your dataset should include all expected variations of what you want to classify, and in sufficient number, such as variations in aspect, color, intensity, rotation, and dimension. Pay special attention to variations that are difficult to obtain and, if necessary, use augmentation to synthesize such variations (for example, rotation and flip). The following example illustrates variations (images) of a single apple object (class).

To classify multiple objects, each of which is a separate class (for example, apples, oranges, and pears), you must provide, for each class, numerous training images of every variation. MIL internally divides that data, which usually occupies a lot of memory, into several sets (for example, by using mini-batches or bagging). Gathering as much data as possible helps ensure it is evenly consumed by training.

It is recommended to have a minimum of 500 dataset entries (images or sets of features) per class, although simple applications can require less, and complex applications can require more. The number of entries required to properly train a classifier context depends on the complexity of the problem, the number of classes, and the number of variations within the classes. Typically, each class should have the same number of entries representing them and, within each class, you should have the same number of entries representing each variation. A balanced dataset is key to achieving a properly trained classifier.

These dataset recommendations are for a complete (default) training, which means you are training a classifier from the ground up; other types of training, such as transfer learning, require less data. For more information, see the Training modes subsection of the Training: CNN section later in this chapter.

Note, if using a coarse segmentation approach (typically, for defect detection), your dataset must specify training images that are smaller than the image on which to predict. These small images represent just the features (for example, just the defects) that you want to identify (classify) on a larger version of the image (supplied at prediction). Given this design, the results that you obtain for each target image at prediction are for each tile sized region in that image. The size of these tile sized regions comes from the size of your dataset images. For more information, see the Coarse segmentation subsection of the Advanced techniques section later in this chapter.

Data foresight

The data that you collect and the classes with which it is organized must represent, as much as possible, a well-posed and consistent problem that you want to train the classifier to solve. Be aware that there are usually several ways to formulate a problem and present the data to solve that problem; it is imperative that this is done in a consistent and perceptive manner. You must know the problem that you want to solve and provide the data to solve it, while at the same time being mindful of ambiguities or unwanted associations that could lead to uncertain or even mistaken classifications.

Illumination variances in your data represents one of the ways a classifier can make an inadvertent association. If illumination is irrelevant to determining the class, the data with which you train should contain different illuminations to remove the bias that a specific illumination condition can introduce.

For example, if your dataset has images of good parts taken near a window on a cloudy day and images of defective parts taken near that window on a sunny day, it is likely for the classifier to learn that a dark image means a good part and a bright image means a defective part. It is vital to anticipate the important variations and provide that data; this allows the classifier to properly learn how to solve the problem without unwanted bias.

Classes and labeling (the ground truth)

As previously discussed, MIL uses supervised training and you must label your dataset entries by indicating the class they represent. This label is called the ground truth (M_CLASS_INDEX_GROUND_TRUTH). Class labels must be unique. Datasets should have 2 or more class labels, depending on the type of problem you are trying to solve; for example:

For a two-class problem, you could specify a Good or Bad class label for every dataset entry (every image or set of features in a dataset is identified as one of these two classes). Another set of labels to use for such problems is Defective or NotDefective.
For an n-class problem, you could specify one of n-labels (for example, ClassMetal, ClassCarpet, or ClassWood) for every dataset entry (every image or set of features in a dataset is identified as one of these n-classes). Another set of labels to use for such problems is ClassMetalGood, ClassMetalWithHole, or ClassMetalWithScratch.

To label images (indicate the ground truth class that they represent, such as Good or Bad), you can use MIL CoPilot. Alternatively, or you can build a simple utility that calls MIL.

If you have a large amount of data to label, it might be possible to label some of it, train your classifier context, call the prediction operation to label the unlabeled dataset entries, and then use those newly labeled entries in your training dataset to continue the training process. For more information, see the Assisted labeling subsection of the Predicting section later in this chapter.

Organized data

To simplify the labeling process, it is highly recommended to acquire your training data in an organized way, such as having one destination folder and, within it, a folder for each class; in this case, the folder name is the label (the ground truth class). To help you do this, you can call MclassControl() with M_MOVE_ENTRIES and a destination folder.

For example, if you call M_MOVE_ENTRIES, and your dataset has the classes Good and Bad, they are added as subfolders to the specified destination folder: \Dest\Good and \Dest\Bad. The image file referenced in every entry is copied, from the path specified in the entry, to the corresponding subfolder. All entries labeled Good are copied to \Dest\Good and all entries labeled Bad are copied to \Dest\Bad. The referenced path in every dataset entry is modified to reflect the new path; from the dataset's perspective, the data is moved and organized. Note, the original source files are not deleted and remain unaffected; the data now exists both in the new location (which the entries now reference) and in the original location (which the entries no longer reference).

If entries do not have a ground truth, they are placed in a subfolder called 'UnlabeledImages'.

Splitting the source dataset

Your source dataset should hold all the data with which to train your classifier. Given enough time, the training accuracy of a sufficiently large CNN classifier that uses your source dataset eventually converges to an accuracy of 100% (0% error). However, this does not mean that the training is successful.

Several sources of errors can limit and improperly bias the performance of a CNN classifier that you trained using one dataset to the point where, if you use the trained classifier with similar but different data (images), it will almost surely fail. You must always use 2 datasets, the training dataset and the development dataset, to train a CNN classifier so that, it not only learns to classify a specific set of images (the training dataset), but also learns the general principles with which to classify the images so that you can use it to successfully classify images with which it did not train (the development dataset).

As previously discussed, you do not need 2 datasets to train a tree ensemble classifier. MIL uses bootstrap aggregating to train such classifiers and, inherent to this process, is a bagging technique, which randomly selects the dataset entries with which to train (in-the-bag) and the dataset entries with which to regulate the training (out-of-bag). For more information, see the Training: tree ensemble section later in this chapter.

Note, you can also have a testing dataset and use it with MclassPredict() to serve as a quarantined final check for any trained classifier. For more information, see the Predicting section later in this chapter.

To establish the training dataset context and the development dataset context from the source dataset context, call MclassSplitDataset(). You can also use this function to establish the testing dataset context. To create these contexts, it is recommended to:

Call MclassSplitDataset() to split the Source dataset context into a SourceTemp dataset context the Testing dataset context.
Call MclassSplitDataset() again to split the SourceTemp dataset context into the Training dataset context and the Development dataset context.

Keep in mind that, regardless of how you create your dataset contexts, the data in each of them must come from the final imaging setup (for example, you should capture all training images in all datasets using the same kind of resolution and illumination conditions, which should in turn should use the same kind of resolution and illumination conditions at prediction time).

Only the training dataset can contain augmented data; the development dataset and the testing dataset must not contain any augmentations. For more information, see the Augmentation subsection of the Advanced techniques section later in this chapter.

Training dataset

The training dataset context holds the entries that update the internal parameters (weights) of the classifier. This is done differently, depending on the classifier (CNN or tree ensemble).

In general, it is recommended that the training dataset holds about 70% of your source data. If your source dataset is significantly large, you can increase this percentage to 80% or 90%. As previously discussed, when using tree ensemble classifiers, your training dataset typically holds 100% of your data (unless you want to use some for the testing dataset).

Training dataset for a CNN

When training a CNN, MIL divides the training dataset into mini-batches on which the training process iterates, adjusting the values of the classifier's internal parameters (weights). The size of the mini-batches is typically limited by the amount of memory available on the system used to perform the calculations (for example, the limits of your GPU memory). To adjust the size of the mini-batches, call MclassControl() with M_MINI_BATCH_SIZE.

The training process performs multiple epochs or cycles over the complete training dataset, until reaching the maximum number of epochs. MIL determines the number of iterations per epoch according to the size of the training dataset divided by the size of the mini-batches. For more information, see the Training: CNN section later in this chapter.

Training dataset for a tree ensemble

When training a tree ensemble, MIL randomly chooses the dataset entries with which to train the classifier and the entries with which to regulate the training. Entries used to train are considered in-the-bag and entries used to regulate are considered out-of-bag. This is done on a tree by tree basis. For more information, see the Training: tree ensemble section later in this chapter.

Development dataset

MIL uses the development dataset to evaluate the performance of the classifier on data that is not involved in establishing the classifier's internal parameters (weights). This is done at the end of each epoch. Once training is complete, an analysis of the training metrics might indicate several actions to take before training once more to obtain better results. In some cases, you can analyze training metrics at the end of an epoch to decide whether to continue or abort training. For more information, see the Training: analyze and adjust section later in this chapter.

In general, it is recommended that the development dataset holds about 10% to 30% of your source data. The development dataset and the testing dataset, if you have one, typically have the same percentage. For example, if your training dataset has 70% of your data, then the development dataset and testing dataset would each have 15%.

Testing dataset

The testing dataset is completely unseen by the training process. You typically use it to help ensure that your classifier is free from any bias it might have inadvertently learned, and as a way to test the final performance of the trained classifier.

To make use of the testing dataset, you must pass it to MclassPredict(), and compare the predicted class to the ground truth class. The accuracy of prediction results from a properly trained classifier should be overwhelmingly similar to the accuracy of training results.

In general, the testing dataset is about the same size as the development dataset.

An example of distributing data among datasets

The following is an example of how data is typically distributed among datasets. Note, 30% of the data in class A, B, and C is split among the development and testing datasets (this amounts to 15% for each class in each of these datasets).

All images in a dataset should come from original sources; do not include images (ROIs) from the same source in both the training dataset and the development dataset, otherwise an undetected overfit issue can occur.

Importing data from a CSV file to a dataset context

You can use a CSV (comma-separated value) file to define data for a dataset, such as authors, class definitions, and entries. You can then add that data to a dataset context, using MclassImport() with M_ENTRIES, M_AUTHORS, and M_CLASS_DEFINITIONS. Using a CSV is an alternative to using MclassControl() and MclassControlEntry() to populate a dataset.

When importing data from a CSV, note the following:

Import authors and class definitions first, and then import entries, as entries require authors and class definitions.
If you only import entries, MIL automatically creates class definitions and authors according to the information in the entries. If you import class definitions or authors afterward, MIL appends them to the already existing ones.
If MIL encounters non-existing class definitions or authors when importing entries, they are automatically added to the dataset using default values.

Note, to export a dataset (for example, to a CSV file), call MclassExport().

General CSV file format

To import data from a CSV file, it must adhere to formatting requirements and contain the necessary headers and corresponding data (this can depend on what you are importing). The following is an example of valid content in a CSV file from which you can import entries (M_ENTRIES):

Key,FilePath,AuthorName,AugmentationSource,RegionType,ClassIdxGroundTruth,UserString

0,E:\Images\Class1\0001.mim,Matrox,NOT_AUGMENTED,WholeImage,0,Any
useful

1,E:\Images\Class1\0002.mim,Matrox,NOT_AUGMENTED,WholeImage,0,meta
information

2,E:\Images\Class1\0003.mim,Matrox,NOT_AUGMENTED,WholeImage,0,can
be written here

3,E:\Images\Class1\0004.mim,Matrox,NOT_AUGMENTED,WholeImage,0,in
the UserString field.

4,E:\Images\Class2\0001.mim,Matrox,NOT_AUGMENTED,WholeImage,1,Lines
that end with a comma

5,E:\Images\Class2\0002.mim,Matrox,NOT_AUGMENTED,WholeImage,1,indicate
that there is no

6,E:\Images\Class2\0003.mim,Matrox,NOT_AUGMENTED,WholeImage,1,meta
information for that entry.

7,E:\Images\Class2\0004.mim,Matrox,NOT_AUGMENTED,WholeImage,1,

8,E:\Images\Class2\Augmented\0001a.mim,Matrox,4,WholeImage,1,augmented
version of the image from the entry index 4.

Regardless of the data you import, CSV files must contain:

Header fields (cells) on the first line that are separated by a comma. You do not need to list the headers in any particular order.
One or more lines below the header containing the information fields corresponding to the headers. You must separate the information fields with a comma and order them according to the order of the header fields. You can leave information fields empty but cannot omit the comma between fields. Each line below the header is one entry in a dataset.

Note, if you have N header fields, you should have N fields on every line below the header, and every line should have N-1 commas.
A header called 'Key'. You can leave the corresponding key information in the lines below this header empty, or you can specify a unique number. In either case, MIL generates a UUID for each key field.

Note, key information depends on the data you are importing. When importing entries, the key corresponds to the entry key, when importing authors or class definitions, the key corresponds to the author key and the class key, respectively.

Typically, CSV headers are PascalCase terms that correspond to an entry setting in MclassControlEntry(). For example, the header 'AugmentationSource' corresponds to the M_AUGMENTATION_SOURCE setting.

Headers for authors

To import authors (M_AUTHORS), use the headers below (and provide the corresponding lines of information after the header) in the CSV file:

Header name	Required or optional	Related MIL setting	Images or features dataset

Key	Required	M_AUTHOR_KEY (MclassInquire())	Both
AuthorName	Required	M_AUTHOR_NAME (MclassControlEntry())	Both

The following is an example of the content in a CSV file that you can use to import authors:

Key,AuthorName
0,SM
1,FX
2,VC
,IR
4,

Headers for class definitions

To import class definitions (M_CLASS_DEFINITIONS), use the headers below (and provide the corresponding lines of information after the header) in the CSV file.

Header name	Required or optional	Related MIL setting	Images or features dataset

Key	Required	M_CLASS_KEY (MclassInquire())	Both
Name	Required	M_CLASS_NAME (MclassControl())	Both
Color_R	Optional	The Red parameter of the M_RGB888() macro (MclassControl() with M_CLASS_DRAW_COLOR)	Both
Color_G	Optional	The Green parameter of the M_RGB888() macro (MclassControl() with M_CLASS_DRAW_COLOR)	Both
Color_B	Optional	The Blue parameter of the M_RGB888() macro (MclassControl() with M_CLASS_DRAW_COLOR)	Both
Weight	Optional	M_CLASS_WEIGHT (MclassControl())	Both

The following is an example of the content in a CSV file that you can use to import class definitions:

Key,Name,Color_R,Color_G,Color_G,Weight
0,Apples,,,,
1,Oranges,,,,

If you do not specify the optional headers, you need not specify the empty fields below; for example:

Key,Name
0,Apples
1,Oranges

Headers for entries

To import entries (M_ENTRIES), use the headers below (and provide the corresponding lines of information after the headers) in the CSV file.

Header name	Required or optional	Related MIL setting	Images or features dataset

Key	Required	M_ENTRY_KEY (MclassInquireEntry())	Both
FilePath	Required	M_FILE_PATH (MclassControlEntry())	Both
AuthorName	Optional	M_AUTHOR_NAME (MclassControlEntry())	Both
AugmentationSource¹	Optional	M_AUGMENTATION_SOURCE (MclassControlEntry())	Both
RegionType²	Optional	M_REGION_TYPE (MclassInquireEntry())	Images dataset only
Data_...³	Optional	M_RAW_DATA (MclassControlEntry())	Features dataset only
ClassIdxGroundTruth_...³	Required	M_CLASS_INDEX_GROUND_TRUTH (MclassControlEntry())	Both
ClassIdxPredicted_...³	Optional	M_CLASS_INDEX_PREDICTED (MclassInquireEntry())	Both
ClassScorePredicted_...³	Optional	M_PREDICTED_CLASS_SCORES (MclassInquireEntry())	Both
UserString	Optional	M_ENTRY_USER_STRING (MclassControlEntry())	Both
EntryWeight	Optional	M_ENTRY_WEIGHT (MclassControlEntry())	Both
UserConfidence	Optional	M_USER_CONFIDENCE (MclassControlEntry())	Both
¹If an entry is not augmented, set the corresponding information field to 'NOT_AUGMENTED'.
²If an entry uses the whole image, set the corresponding information field to 'WholeImage'
³This header represents an array of values. You must replace the ellipsis with an integer, starting at 0 for the first array value, and increasing by 1 for each subsequent array value, such as: 'Data_0,Data_1,Data_2'.

The following is an example of the content in a CSV file that you can use to import class entries that have 3 values for the header field named 'Data_...':

Key,FilePath,AuthorName,AugmentationSource,Data_0,Data_1,Data_2,ClassIdxGroundTruth

0,E:\Images\Class1\0000.mim,SC,NOT_AUGMENTED,99,66,87,0

1,E:\Images\Class1\0001.mim,SM,NOT_AUGMENTED,33,29,30,1

2,E:\Images\Class1\0002.mim,FX,NOT_AUGMENTED,4,19,77,2

When listing the header fields for an array of values, ensure that you separate them with commas and you do not skip a number in the sequence of integer suffixes; for example, listing 'Data_0', 'Data_2', and 'Data_1', will cause a "missing field" error.

UUID

A UUID refers to a universally unique identifier that is used for identification purposes across unrelated systems and applications. The following is an example of an automatically generated key (UUID): 87fbb05a-b078-4389-8ad2-2a2f9707c8f4.

Every dataset entry has a UUID value that uniquely identifies it; this is also known as the entry's key. Since datasets can contain numerous entries from multiple datasets constructed on unrelated systems, this universally unique key helps ensure that the entries from all the datasets are not confused with each other.

MIL automatically generates a UUID when required, such as, when you add an entry to a dataset (M_ENTRY_ADD). Although you do not create you own UUIDs, you can use them to access an entry to, for example, inquire about it or modify it.

The MIL custom data type for UUIDs is MIL_UUID. When using a MIL_UUID variable, MIL defines the M_DEFAULT_UUID and M_NULL_UUID constants, which allow you to specify the corresponding default UUID value or a null UUID value.

MIL_UUID utility macros for C users

To perform comparisons with MIL_UUID values in C, use the following macros:

M_COMPARE_MIL_UUID(VarA, VarB).

This macro evaluates an equal (==) comparison operation between 2 variables.
M_IS_DEFAULT_UUID(VarA) or M_IS_DEFAULT_KEY(VarA).

These macros evaluate whether a variable is equal to the default UUID. These macros are equivalent.
M_IS_NULL_UUID(VarB) or M_IS_NULL_KEY(VarB).

These macros evaluate whether a variable is equal to a null UUID. These macros are equivalent.

The following are examples of how to use these macros in C:

MIL_UUID VarA =
M_DEFAULT_UUID;
MIL_UUID VarA =
M_DEFAULT_UUID;

if(M_COMPARE_MIL_UUID(VarA,
VarB))
{ /* Do something if VarA == VarB;
*/ }

if(M_IS_DEFAULT_UUID(VarA))
{ /* Do something if VarA ==
M_DEFAULT_UUID; */ }

if(M_IS_NULL_UUID(VarB))
{ /* Do something if VarB ==
M_NULL_UUID; */ }

Note, you can use MIL_UUID variables normally in a C++ program, since the MIL_UUID data type, in C++, implements the equal (==) and not equal (!=) comparison operators.