Training: tree ensemble

See also

Reference:

MclassControl MclassControlEntry MclassCopyResult MclassGetResult MclassTrain

Availability

Not available in MIL-Lite

Available in MIL

When you allocate a tree ensemble classifier context, it is essentially empty and ready for training.

By default, the tree ensemble uses 10 trees and has no maximum depth (there is no limit to the number of levels a tree can have). To modify these training settings, and others that affect the internal architecture of the tree ensemble, call MclassControl() and specify the tree ensemble training context.

By default, MIL uses a bootstrap aggregating (bagging) process to train the tree ensemble. You can call MclassControl() and specify the tree ensemble training context to adjust that process. For example, you can decide whether randomly selected entries are available for reselection (whether to bootstrap with or without replacement) or whether to use out-of-bag dataset entries to estimate the generalization accuracy.

Note, bagging information is typically unreliable if your training dataset has augmented entries.

Also note that, you can either train a tree ensemble classifier from the ground up (using an empty tree ensemble classifier context), or you can continue training a previously trained tree ensemble classifier (this is referred to as a warm start). To continue training a previously trained tree ensemble classifier, you must copy the classification result buffer that MclassTrain() produced into a classifier context, using MclassCopyResult(), and then pass that trained tree ensemble context back to MclassTrain().

Feature importance

Every entry in a features dataset refers to a set of features. For example, every entry can refer to a set of blob features, such as area, perimeter, and Feret diameter. Specific values for these features are specified, for each entry, using MclassControlEntry() with M_RAW_DATA.

By specifying a feature importance mode, MIL establishes, during training, whether certain features are more important than others in determining the class to which the input data belongs. For example, the perimeter feature can end up being very important to establishing a successful classification, while the area feature can end up being almost irrelevant.

To specify (or disable) the feature importance mode, call MclassControl() with M_FEATURE_IMPORTANCE_MODE. To retrieve the resulting importance of features after training, call MclassGetResult() with M_FEATURE_IMPORTANCE. Based on this result information, you can modify your set of features and retrain, which can help improve the tree ensemble classifier's performance and accuracy.

M_FEATURE_IMPORTANCE_MODE does not directly affect training. It allows you to retrieve which features are more important; you can then use this information to modify your dataset (for example, you can specify only the important features) and retrain.

Modes

By default, MIL uses a decreasing impurity process (M_MEAN_DECREASE_IMPURITY) to establish the feature importance. In this case, the more a feature affects a proper node splitting, the more important it is. Proper splitting means that the two output sets resulting from splitting the node are (on average) significantly purer (closer to agreeing on the final class) than the node's input set.

Alternatively you can use a drop column or permutation process to establish the feature importance. With a drop column process, the more the elimination of a feature affects accuracy, the more importance that feature is given. With a permutation process, the more the shuffling of a feature affects accuracy, the more importance that feature is given.

To specify a drop column or permutation importance, you must train with a development dataset or compute out-of-bag results (that is, enable M_COMPUTE_OUT_OF_BAG_RESULTS). To specify the set with which to calculate a drop column or permutation importance, use M_FEATURE_IMPORTANCE_SET.

In general, the fastest modes with which to establish the feature importance are M_MEAN_DECREASE_IMPURITY, M_PERMUTATION, and M_DROP_COLUMN, while the modes with which to most accurately establish the feature importance are M_DROP_COLUMN, M_PERMUTATION, and M_MEAN_DECREASE_IMPURITY.

Note, if you disable the feature importance mode, you cannot retrieve any information about the feature importance.