MLPACK  1.0.4
Density Estimation Tree (DET) tutorial

Introduction

DETs perform the unsupervised task of density estimation using decision trees.

The details of this work is presented in the following paper:

@inproceedings{ram2011density,
  title={Density estimation trees},
  author={Ram, P. and Gray, A.G.},
  booktitle={Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  pages={627--635},
  year={2011},
  organization={ACM}
}

mlpack provides:

Table of Contents

A list of all the sections this tutorial contains.

Command-Line 'dt_main'

The command line arguments of this program can be viewed using the '-h' option:

$ ./dt_main -h

Density estimation with DET

  This program provides an example use of the Density Estimation Tree for
  density estimation. For more details, please look at the paper titled 'Density
  Estimation Trees'.

Required options:

  --input/training_set (-S) [string]
                                The data set on which to perform density
                                estimation.

Options:

  --DET/max_leaf_size (-M) [int]
                                The maximum size of a leaf in the unpruned fully
                                grown DET.  Default value 10.
  --DET/min_leaf_size (-N) [int]
                                The minimum size of a leaf in the unpruned fully
                                grown DET.  Default value 5.
  --DET/use_volume_reg (-R)     This flag gives the used the option to use a
                                form of regularization similar to the usual
                                alpha-pruning in decision tree. But instead of
                                regularizing on the number of leaves, you
                                regularize on the sum of the inverse of the
                                volume of the leaves (meaning you penalize  low
                                volume leaves.
  --flag/print_tree (-P)        If you just wish to print the tree out on the
                                command line.
  --flag/print_vi (-I)          If you just wish to print the variable
                                importance of each feature out on the command
                                line.
  --help (-h)                   Default help info.
  --info [string]               Get help on a specific module or option.
                                Default value ''.
  --input/labels (-L) [string]  The labels for the given training data to
                                generate the class membership of each leaf (as
                                an extra statistic)  Default value ''.
  --input/test_set (-T) [string]
                                An extra set of test points on which to estimate
                                the density given the estimator.  Default value
                                ''.
  --output/leaf_class_table (-l) [string]
                                The file in which to output the leaf class
                                membership table.  Default value
                                'leaf_class_membership.txt'.
  --output/test_set_estimates (-t) [string]
                                The file in which to output the estimates on the
                                test set from the final optimally pruned tree.
                                Default value ''.
  --output/training_set_estimates (-s) [string]
                                The file in which to output the estimates on the
                                training set from the final optimally pruned
                                tree.  Default value ''.
  --output/tree (-p) [string]   The file in which to print the final optimally
                                pruned tree.  Default value ''.
  --output/unpruned_tree_estimates (-u) [string]
                                The file in which to output the estimates on the
                                training set from the large unpruned tree.
                                Default value ''.
  --output/vi (-i) [string]     The file to output the variable importance
                                values for each feature.  Default value ''.
  --param/folds (-F) [int]      The number of folds of cross-validation to
                                performed for the estimation (enter 0 for LOOCV)
                                 Default value 10.
  --param/number_of_classes (-C) [int]
                                The number of classes present in the 'labels'
                                set provided  Default value 0.
  --verbose (-v)                Display informational messages and the full list
                                of parameters and timers at the end of
                                execution.

For further information, including relevant papers, citations, and theory,
consult the documentation found at http://www.mlpack.org or included with your
distribution of MLPACK.

Plain-vanilla density estimation

We can just train a DET on the provided data set S. Like all datasets mlpack uses, the data should be row-major (mlpack transposes data when it is loaded; internally, the data is column-major -- see this page for more information).

$ ./dt_main -S dataset.csv -v

By default, dt_main performs 10-fold cross-validation (using the $\alpha$-pruning regularization for decision trees). To perform LOOCV (leave-one-out cross-validation), use the following command:

$ ./dt_main -S dataset.csv -F 0 -v

To perform k-fold crossvalidation, use -F k. There are certain other options available for training. For example, for the construction of the initial tree, you can specify the maximum and minimum leaf sizes. By default, they are 10 and 5 respectively, you can set them using the -M (--maximum_leaf_size) and the -N (--minimum_leaf_size) options.

$ ./dt_main -S dataset.csv -M 20 -N 10

In case you want to output the density estimates at the points in the training set, use the -s option to specify the output file.

$ ./dt_main -S dataset.csv -s density_estimates.txt -v

Alternate DET regularization

The usual regularized error $R_\alpha(t)$ of a node $t$ is given by: $R_\alpha(t) = R(t) + \alpha |\tilde{t}|$ where $R(t) = -\frac{|t|^2}{N^2 V(t)}$. $V(t)$ is the volume of the node $t$ and $\tilde{t}$ is the set of leaves in the subtree rooted at $t$.

For the purposes of density estimation, I have developed a different form of regularization -- instead of penalizing the number of leaves in the subtree, we penalize the sum of the inverse of the volumes of the leaves. Here really small volume nodes are discouraged unless the data actually warrants it. Thus, $R_\alpha'(t) = R(t) + \alpha I_v(\tilde{t})$ where $I_v(\tilde{t}) = \sum_{l \in \tilde{t}} \frac{1}{V(l)}.$ To use this form of regularization, use the -R flag.

$ ./dt_utils -S dataset.csv -R -v

Estimation on a test set

There is the option of training the DET on a certain set and obtaining the density from the learned estimator at some out of sample (new) test points. The -T option is the set of test points and the -t is the file in which the estimates are to be output.

$ ./dt_main -S dataset.csv -T test_points.csv -t test_density_estimates.txt -v

Printing a trained DET

A depth-first visualization of the DET can be obtained by using the -P flag.

$ ./dt_main -S dataset.csv -P -v

To print this tree in a file, use the -p option to specify the output file along with the -P flag.

$ ./dt_main -S dataset.csv -P -p tree.txt -v

Computing the variable importance

The variable importance (with respect to density estimation) of the different features in the data set can be obtained by using the -I option. This outputs the (absolute as opposed to relative) variable importance of the all the features.

$ ./dt_main -S dataset.csv -I -v

To print this in a file, use the -i option

$ ./dt_main -S dataset.csv -I -i variable_importance.txt -v

Leaf Membership

In case the dataset is labeled and you are curious to find the class membership of the leaves of the DET, there is an option of view the class membership. The training data has to still be input in an unlabeled format, but an additional label file containing the corresponding labels of each point has to be input using the -L option. You are required to specify the number of classes present in this set using the -C option.

$ ./dt_main -S dataset.csv -L labels.csv -C <number-of-classes> -v

The leaf membership matrix is output into a file called 'leaf_class_membership.txt' by default. An user-specified file can be used by utilizing the -l option.

$ ./dt_main -S dataset.csv -L labels.csv -C <number-of-classes> -l leaf_class_membership_file.txt -v

The 'DTree' class

This class implements the DET.

#include <mlpack/methods/det/dtree.hpp>

using namespace mlpack::det;

// The dataset matrix, on which to learn the DET
extern arma::Mat<float> data;

// Initializing the class
// This function creates and saves the bounding box of the data.
DTree<>* det = new DTree<>(&data);

Public Functions

Growing the tree to the full size:

// This keeps track of the data during the shuffle
// that occurs while growing the tree.
arma::Col<size_t>* old_from_new = new arma::Col<size_t>(data.n_cols);
for (size_t i = 0; i < data.n_cols; i++) {
  (*old_from_new)[i] = i;
}

// This function grows the tree down to the leaf
// any regularization. It returns the current minimum
// value of the regularization parameter 'alpha'.
bool use_volume_reg = false;
size_t max_leaf_size = 10, min_leaf_size = 5;

long double alpha
  = det->Grow(&data, old_from_new, use_volume_reg,
        max_leaf_size, min_leaf_size);

One step of pruning the tree back up:

// This function performs a single pruning of the
// decision tree for the given value of alpha
// and returns the next minimum value of alpha
// that would induce a pruning
alpha = det->PruneAndUpdate(alpha, use_volume_reg);

Estimating the density at a given query point:

// for a given query, you can obtain the density estimate
extern arma::Col<float> query;
long double estimate = det->Compute(&query);

Computing the variable importance of each feature for the given DET.

// Initialize the importance of every dimension to zero.
arma::Col<double> v_imps = arma::zeros<arma::Col<double> >(data.n_rows);

// You can obtain the variable importance from the current tree.
det->ComputeVariableImportance(&v_imps);

'namespace mlpack::det'

The functions in this namespace allows the user to perform certain tasks with the 'DTree' class.

Utility Functions

Training a DET (with cross-validation)

#include <mlpack/methods/det/dt_utils.hpp>

using namespace mlpack::det;

// The dataset matrix, on which to learn the DET
extern arma::Mat<float> data;

// the number of folds in the cross-validation
size_t folds = 10; // set folds = 0 for LOOCV

bool use_volume_reg = false;
size_t max_leaf_size = 10, min_leaf_size = 5;

// obtain the trained DET
DTree<>* dtree_opt = Trainer(&data, folds, use_volume_reg,
           max_leaf_size, min_leaf_size);

Generating leaf-class membership

extern arma::Mat<int> labels;
size_t number_of_classes = 3; // this is required

extern string leaf_class_membership_file;

PrintLeafMembership(dtree_opt, data, labels, number_of_classes,
        leaf_class_membership_file);

Generating variable

extern string variable_importance_file;
size_t number_of_features = data.n_rows;

PrintVariableImportance(dtree_opt, nunmber_of_features,
      variable_importance_file);

Further Documentation

For further documentation on the DTree class, consult the complete API documentation.