--- title: "eforest(): Random Forests With Energy Trees as Base Learners" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{eforest(): Random Forests With Energy Trees as Base Learners} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction etree is the R package where Energy Trees (Giubilei et al., 2022) are implemented. The model allows to perform classification and regression with covariates that are possibly structured and of different types. Energy Trees are used as base learners in ensemble methods that are the equivalent of bagging and Random Forests with traditional decision trees. The two resulting models are bagging of Energy Trees and Random Energy Forests, respectively. This vignette explains how these two models can be used in R for classification and regression tasks. Both models are implemented using the function eforest(). The variables' types included in the examples are those currently available in the package: numeric, nominal, functional, and in the form of graphs. The first thing to do to run the examples is loading the package etree. ```{r setup} library(etree) ``` # Data generation We need data for our examples. We can use the same covariates for regression and classification, and then change only the response. Let us consider $100$ observations divided into four equal-sized and disjoint groups $G_1, G_2, G_3, G_4$. The response variable for regression is defined as: $$Y^R_i \sim \begin{cases} \mathcal{N}(0, 1) & \text{for} \; i \in G_1\\ \mathcal{N}(1, 1) & \text{for} \; i \in G_2\\ \mathcal{N}(2, 1) & \text{for} \; i \in G_3\\ \mathcal{N}(3, 1) & \text{for} \; i \in G_4 \end{cases}.$$ The corresponding R object must be of class "numeric" to be correctly recognized by eforest() as a response variable for regression. ```{r} # Set seed set.seed(123) # Number of observation n_obs <- 100 # Response variable for regression resp_reg <- c(rnorm(n_obs / 4, mean = 0, sd = 1), rnorm(n_obs / 4, mean = 2, sd = 1), rnorm(n_obs / 4, mean = 4, sd = 1), rnorm(n_obs / 4, mean = 6, sd = 1)) ``` The response for classification is a nominal variable measured at four different levels: $1, 2, 3, 4$. For each group, observations are randomly assigned one of the four levels. The probability distribution is different for each group, and it gives the largest probability to the level equal to the index of the group, uniformly allocating the rest to the other three levels. For example, the response variable for the first group is the following: $$Y^C_i \sim \begin{cases} 1 & p = 0.85 \\ 2 & p = 0.05 \\ 3 & p = 0.05 \\ 4 & p = 0.05 \end{cases}, \quad \text{for} \; i \in G_1.$$ The same holds for $G_2, G_3, G_4$, except for switching each time the probabilities in such a way that the most probable level matches the index of the group. The response variable for classification must be defined as a factor to be correctly identified by eforest(). ```{r} # Response variable for classification resp_cls <- factor(c(sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.85, 0.05, 0.05, 0.05)), sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.05, 0.85, 0.05, 0.05)), sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.05, 0.05, 0.85, 0.05)), sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.05, 0.05, 0.05, 0.85)))) ``` For both tasks, covariates are defined as follows: 1. Numeric: $$X_{1i} \sim \begin{cases} \mathcal{U}[0, 0.55] & \text{for} \; i \in G_1\\ \mathcal{U}[0.45, 1] & \text{for} \; i \in G_2 \cup G_3 \cup G_4 \end{cases};$$ 2. Nominal: $$X_{2i} \sim \begin{cases} \text{Bern}(0.1) & \text{for} \; i \in G_2\\ \text{Bern}(0.9) & \text{for} \; i \in G_1 \cup G_3 \cup G_4 \end{cases};$$ 3. Functions: $$X_{3i} \sim \begin{cases} \mathcal{GP}(0, I_{100}) & \text{for} \; \in G_3\\ \mathcal{GP}(1, I_{100}) & \text{for} \; \in G_1 \cup G_2 \cup G_4 \end{cases},$$ where $\mathcal{GP}$ is a Gaussian random process over $100$ evaluation points from $0$ to $1$; 4. Graphs: $$X_{4i} \sim \begin{cases} \text{ER}(100, 0.1) & \text{for} \; \in G_4\\ \text{ER}(100, 0.9) & \text{for} \; \in G_1 \cup G_2 \cup G_3 \end{cases},$$ where $\text{ER}(n, p)$ is an Erdős–Rényi random graph with $n$ vertices and $p$ as connection probability. The structure of the generative process is such that we need at least three splits with respect to three different covariates to correctly rebuild the four groups. The set of covariates must be generated as a list, where each element is a different variable. Additionally, numeric variables must be numeric vectors, nominal variable must be factors, functional variables must be objects of class "fdata", and variables in the forms of graphs must be lists of objects having class "igraph". ```{r} # Numeric x1 <- c(runif(n_obs / 4, min = 0, max = 0.55), runif(n_obs / 4 * 3, min = 0.45, max = 1)) # Nominal x2 <- factor(c(rbinom(n_obs / 4, 1, 0.1), rbinom(n_obs / 4, 1, 0.9), rbinom(n_obs / 2, 1, 0.1))) # Functions x3 <- c(fda.usc::rproc2fdata(n_obs / 2, seq(0, 1, len = 100), sigma = 1), fda.usc::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1, mu = rep(1, 100)), fda.usc::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1)) # Graphs x4 <- c(lapply(1:(n_obs / 4 * 3), function(j) igraph::sample_gnp(100, 0.1)), lapply(1:(n_obs / 4), function(j) igraph::sample_gnp(100, 0.9))) # Covariates list cov_list <- list(X1 = x1, X2 = x2, X3 = x3, X4 = x4) ``` # Function eforest() Function eforest() allows implementing bagging of Energy Trees (BET) and Random Energy Forests (REF), depending on the value of the random_covs argument: if it is a positive integer smaller than the total number of covariates, the function goes for REF; if it is NULL (default), or equal to the number of covariates, BET is used. As for etree(), also eforest() is specified in the same way for regression and classification tasks, automatically distinguishing between the two based on the type of the response. Many arguments are inherited from etree(). The additional ones are: * ntrees, i.e., the number of Energy Trees to grow in the ensemble; * ncores, i.e., the number of cores to use; * perf_metric, i.e., the performance metric used to compute the Out-Of_Bag error; * random_covs, i.e., the size of the random subset of covariates to choose from at each split; * verbose, to visually keep track of the number of trees during the fitting process. The structure of eforest() is such that it first generates ntrees bootstrap samples and then calls etree() on each of them. Then, it computes the Out-Of-Bag (OOB) score using the performance metric defined through perf_metric. The object returned by eforest() is of class "eforest" and it is composed of three elements: * ensemble, i.e., a list containing all the fitted trees; * oob_score, i.e., the Out-Of-Bag score; * perf_metric, i.e., the performance metric used for computing the OOB. # Regression The first example is regression. The set of covariates is cov_list and the response variable is resp_reg. The most immediate way to grow an ensemble using eforest() is by specifying the response and the set of covariates only. This corresponds to using the default values for the optional arguments. Here, we add ntrees = 12 to reduce the computational burden and change the performance metric using perf_metric = 'MSE' (the default is RMSPE for Root Mean Square Percentage Error). ```{r} # Quick fit ef_fit <- eforest(response = resp_reg, covariates = cov_list, ntrees = 12, perf_metric = 'MSE') ``` By default, eforest() uses minbucket = 1 and alpha = 1 to grow larger and less correlated trees. It employs "cluster" as split_type because it is usually faster. For regression, random_covs is set to one third (rounded down) of the total number of covariates. The returned object, ef_fit, stores all the trees in its element ensemble, which is a list. Each tree in the list can be inspected, plotted, subsetted, and used for predictions just as any other tree produced with etree() (in fact, also these trees are produced with etree()!). As an example, we can plot the first tree. ```{r, fig.dim = c(25, 11)} # Plot of a tree from the ensemble plot(ef_fit$ensemble[[4]]) ``` The Out-Of-Bag score for the whole ensemble is stored in ef_fit's element oob_score. The performance metric used to compute the OOB score can be retrieved from element perf_metric. ```{r} # Retrieve performance metric ef_fit$perf_metric # OOB MSE for this ensemble ef_fit$oob_score ``` Finally, the ensemble can be used to make predictions using the function predict() on ef_fit, and possibly specifying a new set of covariates using newdata. Any other information, such as the splitting strategy or the type of the task, is automatically retrieved from the fitted object. Predictions are based either on the fitted values - if newdata is not provided - or on the new set of covariates - if newdata is specified. In both cases, each tree in ef_fit$ensemble is used to make predictions by calling predict() on it (with the same specification of newdata). Then, for regression, individual trees' predictions for single observations are combined by arithmetic mean. Let's start with the case where newdata is not provided. ```{r} # Predictions from the fitted object pred <- predict(ef_fit) print(pred) ``` However, this case is rarely of practical utility because the model is usually evaluated using the OOB score. Let's see an example where predictions are calculated for a new set of covariates. ```{r} # New set of covariates n_obs <- 40 x1n <- c(runif(n_obs / 4, min = 0, max = 0.55), runif(n_obs / 4 * 3, min = 0.45, max = 1)) x2n <- factor(c(rbinom(n_obs / 4, 1, 0.1), rbinom(n_obs / 4, 1, 0.9), rbinom(n_obs / 2, 1, 0.1))) x3n <- c(fda.usc::rproc2fdata(n_obs / 2, seq(0, 1, len = 100), sigma = 1), fda.usc::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1, mu = rep(1, 100)), fda.usc::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1)) x4n <- c(lapply(1:(n_obs / 4 * 3), function(j) igraph::sample_gnp(100, 0.1)), lapply(1:(n_obs / 4), function(j) igraph::sample_gnp(100, 0.9))) new_cov_list <- list(X1 = x1n, X2 = x2n, X3 = x3n, X4 = x4n) # New response new_resp_reg <- c(rnorm(n_obs / 4, mean = 0, sd = 1), rnorm(n_obs / 4, mean = 2, sd = 1), rnorm(n_obs / 4, mean = 4, sd = 1), rnorm(n_obs / 4, mean = 6, sd = 1)) # Predictions new_pred <- predict(ef_fit, newdata = new_cov_list) print(new_pred) ``` We can compare the Mean Square Error between the response and the predicted values with that between the response and its average to see how predictions are on average closer to the actual values than the response. ```{r} # MSE between the new response and its average mean((new_resp_reg - mean(new_resp_reg)) ^ 2) # MSE between the new response and predictions with the new set of covariates mean((new_resp_reg - new_pred) ^ 2) ``` # Classification The second example is classification. The set of covariates is the same as before, i.e., cov_list, while the response variable is resp_cls. Also in this case, the most immediate fit can be obtained by calling eforest() and specifying the response and the set of covariates only. Here, we also set ntrees = 12 to reduce the computational burden and split_type = "coeff" to employ the other splitting strategy. ```{r, fig.dim = c(7, 6)} # Quick fit ef_fit <- eforest(response = resp_cls, covariates = cov_list, ntrees = 12, split_type = 'coeff') ``` While everything else works as for regression, for classification tasks the default performance metric is Balanced Accuracy, while random_covs is set to the square root (rounded down) of the number of covariates. Predictions are obtained using the same commands as for regression. However, internally, individual trees' predictions for single observations are aggregated by majority voting rule. ```{r} # Predictions from the fitted object pred <- predict(ef_fit) print(pred) ``` To exemplify the use of predict() with newdata, we can use the new set of covariates generated for regression, coupling it with a new response variable for classification. ```{r} # New response new_resp_cls <- factor(c(sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.85, 0.05, 0.05, 0.05)), sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.05, 0.85, 0.05, 0.05)), sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.05, 0.05, 0.85, 0.05)), sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.05, 0.05, 0.05, 0.85)))) # Predictions new_pred <- predict(ef_fit, newdata = new_cov_list) print(new_pred) # Confusion matrix between the new response and predictions from the fitted tree table(new_pred, new_resp_cls, dnn = c('Predicted', 'True')) # Misclassification error for predictions on the new set of covariates sum(new_pred != new_resp_cls) / length(new_resp_cls) ``` In this case, we obtain a misclassification error of 0.225 on test data, which should be however taken with a grain of salt given the very small number of trees that have been used in this toy example. # References R. Giubilei, T. Padellini, P. Brutti (2022). Energy Trees: Regression and Classification With Structured and Mixed-Type Covariates. arXiv preprint. https://arxiv.org/pdf/2207.04430.pdf.