etree is the R package where Energy Trees (Giubilei et al., 2022) are implemented. The model allows to perform classification and regression with covariates that are possibly structured and of different types.
Energy Trees are used as base learners in ensemble methods that are the equivalent of bagging and Random Forests with traditional decision trees. The two resulting models are bagging of Energy Trees and Random Energy Forests, respectively.
This vignette explains how these two models can be used in R for classification and regression tasks. Both models are implemented using the function eforest(). The variables’ types included in the examples are those currently available in the package: numeric, nominal, functional, and in the form of graphs.
The first thing to do to run the examples is loading the package etree.
We need data for our examples. We can use the same covariates for regression and classification, and then change only the response. Let us consider 100 observations divided into four equal-sized and disjoint groups G1, G2, G3, G4. The response variable for regression is defined as:
$$Y^R_i \sim \begin{cases} \mathcal{N}(0, 1) & \text{for} \; i \in G_1\\ \mathcal{N}(1, 1) & \text{for} \; i \in G_2\\ \mathcal{N}(2, 1) & \text{for} \; i \in G_3\\ \mathcal{N}(3, 1) & \text{for} \; i \in G_4 \end{cases}.$$
The corresponding R object must be of class “numeric” to be correctly recognized by eforest() as a response variable for regression.
# Set seed
set.seed(123)
# Number of observation
n_obs <- 100
# Response variable for regression
resp_reg <- c(rnorm(n_obs / 4, mean = 0, sd = 1),
rnorm(n_obs / 4, mean = 2, sd = 1),
rnorm(n_obs / 4, mean = 4, sd = 1),
rnorm(n_obs / 4, mean = 6, sd = 1))
The response for classification is a nominal variable measured at four different levels: 1, 2, 3, 4. For each group, observations are randomly assigned one of the four levels. The probability distribution is different for each group, and it gives the largest probability to the level equal to the index of the group, uniformly allocating the rest to the other three levels. For example, the response variable for the first group is the following:
$$Y^C_i \sim \begin{cases} 1 & p = 0.85 \\ 2 & p = 0.05 \\ 3 & p = 0.05 \\ 4 & p = 0.05 \end{cases}, \quad \text{for} \; i \in G_1.$$
The same holds for G2, G3, G4, except for switching each time the probabilities in such a way that the most probable level matches the index of the group.
The response variable for classification must be defined as a factor to be correctly identified by eforest().
# Response variable for classification
resp_cls <- factor(c(sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.85, 0.05, 0.05, 0.05)),
sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.05, 0.85, 0.05, 0.05)),
sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.05, 0.05, 0.85, 0.05)),
sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.05, 0.05, 0.05, 0.85))))
For both tasks, covariates are defined as follows:
Numeric: $$X_{1i} \sim \begin{cases} \mathcal{U}[0, 0.55] & \text{for} \; i \in G_1\\ \mathcal{U}[0.45, 1] & \text{for} \; i \in G_2 \cup G_3 \cup G_4 \end{cases};$$
Nominal: $$X_{2i} \sim \begin{cases} \text{Bern}(0.1) & \text{for} \; i \in G_2\\ \text{Bern}(0.9) & \text{for} \; i \in G_1 \cup G_3 \cup G_4 \end{cases};$$
Functions: $$X_{3i} \sim \begin{cases} \mathcal{GP}(0, I_{100}) & \text{for} \; \in G_3\\ \mathcal{GP}(1, I_{100}) & \text{for} \; \in G_1 \cup G_2 \cup G_4 \end{cases},$$
where 𝒢𝒫 is a Gaussian random process over 100 evaluation points from 0 to 1;
where ER(n, p) is an Erdős–Rényi random graph with n vertices and p as connection probability.
The structure of the generative process is such that we need at least three splits with respect to three different covariates to correctly rebuild the four groups.
The set of covariates must be generated as a list, where each element is a different variable. Additionally, numeric variables must be numeric vectors, nominal variable must be factors, functional variables must be objects of class “fdata”, and variables in the forms of graphs must be lists of objects having class “igraph”.
# Numeric
x1 <- c(runif(n_obs / 4, min = 0, max = 0.55),
runif(n_obs / 4 * 3, min = 0.45, max = 1))
# Nominal
x2 <- factor(c(rbinom(n_obs / 4, 1, 0.1),
rbinom(n_obs / 4, 1, 0.9),
rbinom(n_obs / 2, 1, 0.1)))
# Functions
x3 <- c(fda.usc::rproc2fdata(n_obs / 2, seq(0, 1, len = 100), sigma = 1),
fda.usc::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1,
mu = rep(1, 100)),
fda.usc::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1))
# Graphs
x4 <- c(lapply(1:(n_obs / 4 * 3), function(j) igraph::sample_gnp(100, 0.1)),
lapply(1:(n_obs / 4), function(j) igraph::sample_gnp(100, 0.9)))
# Covariates list
cov_list <- list(X1 = x1, X2 = x2, X3 = x3, X4 = x4)
Function eforest() allows implementing bagging of Energy Trees (BET) and Random Energy Forests (REF), depending on the value of the random_covs argument: if it is a positive integer smaller than the total number of covariates, the function goes for REF; if it is NULL (default), or equal to the number of covariates, BET is used.
As for etree(), also eforest() is specified in the same way for regression and classification tasks, automatically distinguishing between the two based on the type of the response.
Many arguments are inherited from etree(). The additional ones are:
The structure of eforest() is such that it first generates ntrees bootstrap samples and then calls etree() on each of them. Then, it computes the Out-Of-Bag (OOB) score using the performance metric defined through perf_metric.
The object returned by eforest() is of class “eforest” and it is composed of three elements:
The first example is regression. The set of covariates is cov_list and the response variable is resp_reg. The most immediate way to grow an ensemble using eforest() is by specifying the response and the set of covariates only. This corresponds to using the default values for the optional arguments. Here, we add ntrees = 12 to reduce the computational burden and change the performance metric using perf_metric = ‘MSE’ (the default is RMSPE for Root Mean Square Percentage Error).
# Quick fit
ef_fit <- eforest(response = resp_reg,
covariates = cov_list,
ntrees = 12,
perf_metric = 'MSE')
#> Warning: executing %dopar% sequentially: no parallel backend registered
By default, eforest() uses minbucket = 1 and alpha = 1 to grow larger and less correlated trees. It employs “cluster” as split_type because it is usually faster. For regression, random_covs is set to one third (rounded down) of the total number of covariates.
The returned object, ef_fit, stores all the trees in its element ensemble, which is a list. Each tree in the list can be inspected, plotted, subsetted, and used for predictions just as any other tree produced with etree() (in fact, also these trees are produced with etree()!). As an example, we can plot the first tree.
The Out-Of-Bag score for the whole ensemble is stored in ef_fit’s element oob_score. The performance metric used to compute the OOB score can be retrieved from element perf_metric.
# Retrieve performance metric
ef_fit$perf_metric
#> [1] "MSE"
# OOB MSE for this ensemble
ef_fit$oob_score
#> [1] 1.289393
Finally, the ensemble can be used to make predictions using the function predict() on ef_fit, and possibly specifying a new set of covariates using newdata. Any other information, such as the splitting strategy or the type of the task, is automatically retrieved from the fitted object.
Predictions are based either on the fitted values - if newdata is not provided - or on the new set of covariates - if newdata is specified.
In both cases, each tree in ef_fit$ensemble is used to make predictions by calling predict() on it (with the same specification of newdata). Then, for regression, individual trees’ predictions for single observations are combined by arithmetic mean.
Let’s start with the case where newdata is not provided.
# Predictions from the fitted object
pred <- predict(ef_fit)
print(pred)
#> 10 48 57 85 52 16 34 48
#> 2.679902 4.135738 3.360645 3.106418 2.978768 4.542121 2.108123 2.074140
#> 59 67 107 10 62 25 17 12
#> 3.403327 3.188509 3.822816 2.831832 2.872119 3.641068 3.316160 2.588191
#> 16 17 42 36 7 52 7 44
#> 3.208347 2.627438 2.753697 2.849644 3.053536 1.750992 3.288001 2.565320
#> 14 69 98 42 106 43 64 50
#> 3.494627 4.665498 3.289205 1.789143 2.636787 1.205594 3.731749 3.301720
#> 107 11 101 28 73 99 88 63
#> 2.392789 3.779408 3.386390 2.012728 3.844547 1.957984 3.167867 4.129345
#> 69 34 36 98 86 35 42 29
#> 3.612086 2.768118 3.676978 3.340629 3.766601 2.573598 2.680614 2.176216
#> 85 69 69 38 8 12 34 78
#> 2.571079 3.473253 2.737670 3.507288 2.696437 3.004540 3.006698 3.761995
#> 67 29 42 77 91 14 12 69
#> 4.045500 2.792889 2.320264 4.485125 2.324540 3.569736 4.016510 3.456564
#> 59 51 62 40 100 105 52 109
#> 3.580458 2.103570 2.742141 3.051180 2.831785 3.206117 3.302879 3.207901
#> 27 88 103 77 58 7 108 38
#> 2.385346 2.090764 3.679654 2.736382 2.914928 1.201987 2.023332 2.324558
#> 12 64 77 39 93 65 69 65
#> 3.760490 2.528069 3.394093 3.203227 3.022865 3.431177 3.248708 2.412103
#> 78 92 48 36 75 7 34 50
#> 3.858941 3.241836 3.920557 3.271207 2.604875 3.395582 2.321145 2.585861
#> 67 74 43 44
#> 3.805606 4.150609 2.177344 3.618990
However, this case is rarely of practical utility because the model is usually evaluated using the OOB score.
Let’s see an example where predictions are calculated for a new set of covariates.
# New set of covariates
n_obs <- 40
x1n <- c(runif(n_obs / 4, min = 0, max = 0.55),
runif(n_obs / 4 * 3, min = 0.45, max = 1))
x2n <- factor(c(rbinom(n_obs / 4, 1, 0.1),
rbinom(n_obs / 4, 1, 0.9),
rbinom(n_obs / 2, 1, 0.1)))
x3n <- c(fda.usc::rproc2fdata(n_obs / 2, seq(0, 1, len = 100), sigma = 1),
fda.usc::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1,
mu = rep(1, 100)),
fda.usc::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1))
x4n <- c(lapply(1:(n_obs / 4 * 3), function(j) igraph::sample_gnp(100, 0.1)),
lapply(1:(n_obs / 4), function(j) igraph::sample_gnp(100, 0.9)))
new_cov_list <- list(X1 = x1n, X2 = x2n, X3 = x3n, X4 = x4n)
# New response
new_resp_reg <- c(rnorm(n_obs / 4, mean = 0, sd = 1),
rnorm(n_obs / 4, mean = 2, sd = 1),
rnorm(n_obs / 4, mean = 4, sd = 1),
rnorm(n_obs / 4, mean = 6, sd = 1))
# Predictions
new_pred <- predict(ef_fit,
newdata = new_cov_list)
print(new_pred)
#> 28 50 42 36 38 7 7 7
#> 0.1861834 2.6370162 0.2083508 0.5578421 0.3257152 0.3634092 0.4702349 0.1322439
#> 11 42 52 12 52 11 57 11
#> 2.6203437 0.6641498 2.2038720 2.4452980 1.9911947 2.2209439 2.1411308 2.3513359
#> 57 50 52 62 100 99 92 100
#> 1.9500233 2.3971748 1.8652271 2.2018243 4.2073993 4.5920318 3.7338271 4.2410766
#> 109 105 88 88 99 99 17 17
#> 3.6221229 4.1800621 4.4216832 4.3556880 3.9694942 4.0655768 5.8279225 5.8652925
#> 16 67 78 69 77 69 16 67
#> 6.0119947 5.8300567 5.9093864 6.1107621 5.9784577 6.2000173 6.1465376 6.0543855
We can compare the Mean Square Error between the response and the predicted values with that between the response and its average to see how predictions are on average closer to the actual values than the response.
The second example is classification. The set of covariates is the same as before, i.e., cov_list, while the response variable is resp_cls. Also in this case, the most immediate fit can be obtained by calling eforest() and specifying the response and the set of covariates only. Here, we also set ntrees = 12 to reduce the computational burden and split_type = “coeff” to employ the other splitting strategy.
# Quick fit
ef_fit <- eforest(response = resp_cls,
covariates = cov_list,
ntrees = 12,
split_type = 'coeff')
While everything else works as for regression, for classification tasks the default performance metric is Balanced Accuracy, while random_covs is set to the square root (rounded down) of the number of covariates.
Predictions are obtained using the same commands as for regression. However, internally, individual trees’ predictions for single observations are aggregated by majority voting rule.
# Predictions from the fitted object
pred <- predict(ef_fit)
print(pred)
#> 34 18 33 7 18 5 19 21 25 18 21 14 7 26 24 14 7 24 31 7 7 10 31 35 35 26
#> 1 2 1 1 4 4 4 2 3 4 3 4 4 2 4 2 1 1 2 2 1 1 1 2 1 1
#> 7 18 14 7 7 7 31 21 5 18 25 31 31 33 34 21 21 7 14 6 7 6 21 34 21 21
#> 1 4 4 1 3 1 1 2 2 4 1 4 3 2 3 2 1 4 4 2 1 2 2 3 2 4
#> 31 7 26 31 21 18 5 21 21 7 24 21 21 21 14 7 7 18 16 18 7 16 34 18 7 7
#> 2 4 1 1 1 1 3 2 2 1 2 1 1 3 1 1 3 4 4 1 4 4 2 2 2 1
#> 29 7 14 33 17 26 18 18 6 19 18 7 18 7 24 18 7 18 34 7 21 21
#> 1 2 1 2 1 4 3 4 1 4 4 2 2 1 2 4 2 4 2 4 2 2
#> Levels: 1 2 3 4
To exemplify the use of predict() with newdata, we can use the new set of covariates generated for regression, coupling it with a new response variable for classification.
# New response
new_resp_cls <- factor(c(sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.85, 0.05, 0.05, 0.05)),
sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.05, 0.85, 0.05, 0.05)),
sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.05, 0.05, 0.85, 0.05)),
sample(x = 1:4, size = n_obs / 4, replace = TRUE,
prob = c(0.05, 0.05, 0.05, 0.85))))
# Predictions
new_pred <- predict(ef_fit,
newdata = new_cov_list)
print(new_pred)
#> 7 16 7 7 7 7 7 5 25 7 21 21 21 21 21 21 21 21 21 21 33 29 33 34 29 34
#> 1 3 1 1 1 1 1 1 2 1 2 4 2 2 2 2 2 2 2 2 2 3 2 3 2 3
#> 35 35 31 33 19 18 18 18 18 18 18 14 18 17
#> 3 3 3 2 4 4 4 4 4 4 4 4 4 4
#> Levels: 1 2 3 4
# Confusion matrix between the new response and predictions from the fitted tree
table(new_pred, new_resp_cls, dnn = c('Predicted', 'True'))
#> True
#> Predicted 1 2 3 4
#> 1 6 1 0 1
#> 2 1 7 4 2
#> 3 1 1 5 0
#> 4 0 3 0 8
# Misclassification error for predictions on the new set of covariates
sum(new_pred != new_resp_cls) / length(new_resp_cls)
#> [1] 0.35
In this case, we obtain a misclassification error of 0.225 on test data, which should be however taken with a grain of salt given the very small number of trees that have been used in this toy example.
R. Giubilei, T. Padellini, P. Brutti (2022). Energy Trees: Regression and Classification With Structured and Mixed-Type Covariates. arXiv preprint. https://arxiv.org/pdf/2207.04430.pdf.