--- title: "etree: Classification and Regression With Structured and Mixed-Type Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{etree: Classification and Regression With Structured and Mixed-Type Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction etree is the R package where Energy Trees (Giubilei et al., 2022) are implemented. The model allows to perform classification and regression with covariates that are possibly structured and of different types. This vignette includes two examples, one for classification and one for regression, where all the package's basic functionalities are explained. The variables' types included in the examples are those currently available in the package: numeric, nominal, functional, and in the form of graphs. The first thing to do to run the examples is loading the package etree. ```{r setup} library(etree) ``` # Data generation We need data for our examples. We can use the same covariates for regression and classification, and then change only the response. Let us consider $100$ observations divided into four equal-sized and disjoint groups $G_1, G_2, G_3, G_4$. The response variable for regression is defined as: $$Y^R_i \sim \begin{cases} \mathcal{N}(0, 1) & \text{for} \; i \in G_1\\ \mathcal{N}(1, 1) & \text{for} \; i \in G_2\\ \mathcal{N}(2, 1) & \text{for} \; i \in G_3\\ \mathcal{N}(3, 1) & \text{for} \; i \in G_4 \end{cases}.$$ The corresponding R object must be of class "numeric" to be correctly recognized by etree() as a response variable for regression. ```{r} # Set seed set.seed(123) # Number of observation n_obs <- 100 # Response variable for regression resp_reg <- c(rnorm(n_obs / 4, mean = 0, sd = 1), rnorm(n_obs / 4, mean = 2, sd = 1), rnorm(n_obs / 4, mean = 4, sd = 1), rnorm(n_obs / 4, mean = 6, sd = 1)) ``` The response for classification is a nominal variable measured at four different levels: $1, 2, 3, 4$. For each group, observations are randomly assigned one of the four levels. The probability distribution is different for each group, and it gives the largest probability to the level equal to the index of the group, uniformly allocating the rest to the other three levels. For example, the response variable for the first group is the following: $$Y^C_i \sim \begin{cases} 1 & p = 0.85 \\ 2 & p = 0.05 \\ 3 & p = 0.05 \\ 4 & p = 0.05 \end{cases}, \quad \text{for} \; i \in G_1.$$ The same holds for $G_2, G_3, G_4$, except for switching each time the probabilities in such a way that the most probable level matches the index of the group. The response variable for classification must be defined as a factor to be correctly identified by etree(). ```{r} # Response variable for classification resp_cls <- factor(c(sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.85, 0.05, 0.05, 0.05)), sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.05, 0.85, 0.05, 0.05)), sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.05, 0.05, 0.85, 0.05)), sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.05, 0.05, 0.05, 0.85)))) ``` For both tasks, covariates are defined as follows: 1. Numeric: $$X_{1i} \sim \begin{cases} \mathcal{U}[0, 0.55] & \text{for} \; i \in G_1\\ \mathcal{U}[0.45, 1] & \text{for} \; i \in G_2 \cup G_3 \cup G_4 \end{cases};$$ 2. Nominal: $$X_{2i} \sim \begin{cases} \text{Bern}(0.1) & \text{for} \; i \in G_2\\ \text{Bern}(0.9) & \text{for} \; i \in G_1 \cup G_3 \cup G_4 \end{cases};$$ 3. Functions: $$X_{3i} \sim \begin{cases} \mathcal{GP}(0, I_{100}) & \text{for} \; \in G_3\\ \mathcal{GP}(1, I_{100}) & \text{for} \; \in G_1 \cup G_2 \cup G_4 \end{cases},$$ where $\mathcal{GP}$ is a Gaussian random process over $100$ evaluation points from $0$ to $1$; 4. Graphs: $$X_{4i} \sim \begin{cases} \text{ER}(100, 0.1) & \text{for} \; \in G_4\\ \text{ER}(100, 0.9) & \text{for} \; \in G_1 \cup G_2 \cup G_3 \end{cases},$$ where $\text{ER}(n, p)$ is an Erdős–Rényi random graph with $n$ vertices and $p$ as connection probability. The structure of the generative process is such that we need at least three splits with respect to three different covariates to correctly rebuild the four groups. The set of covariates must be generated as a list, where each element is a different variable. Additionally, numeric variables must be numeric vectors, nominal variable must be factors, functional variables must be objects of class "fdata", and variables in the forms of graphs must be lists of objects having class "igraph". ```{r} # Numeric x1 <- c(runif(n_obs / 4, min = 0, max = 0.55), runif(n_obs / 4 * 3, min = 0.45, max = 1)) # Nominal x2 <- factor(c(rbinom(n_obs / 4, 1, 0.1), rbinom(n_obs / 4, 1, 0.9), rbinom(n_obs / 2, 1, 0.1))) # Functions x3 <- c(fda.usc::rproc2fdata(n_obs / 2, seq(0, 1, len = 100), sigma = 1), fda.usc::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1, mu = rep(1, 100)), fda.usc::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1)) # Graphs x4 <- c(lapply(1:(n_obs / 4 * 3), function(j) igraph::sample_gnp(100, 0.1)), lapply(1:(n_obs / 4), function(j) igraph::sample_gnp(100, 0.9))) # Covariates list cov_list <- list(X1 = x1, X2 = x2, X3 = x3, X4 = x4) ``` # Regression The first example is regression. The set of covariates is cov_list and the response variable is resp_reg. We start with a "quick fit", which implies using etree()'s default parameters. Then, we explore how changing the most important ones affects the size and the structure of the resulting trees. Finally, we use a fitted tree to explore other fundamental utilities such as printing and predicting. ## Quick fit The quickest way to grow an Energy Tree is as simple as using etree() and specifying the response and the set of covariates only. ```{r} # Quick fit etree_fit <- etree(response = resp_reg, covariates = cov_list) ``` Calling plot() on the fitted object etree_fit allows visualizing the tree. ```{r, fig.dim = c(7.5, 6)} # Plot plot(etree_fit) ``` This is a good place to stop and comment the main features of etree's plots. Many of them descend from the partykit (Hothorn and Zeileis, 2015) version: * the number in the square box is the id of the node; * the text below the square box is the name of the variable selected for splitting; * the number right below is the p-value for the Energy test of independence between the splitting covariate and the response variable; * when the splitting covariate is traditional, values on the edges represent the split point for that variable; * terminal nodes have a header showing the id and the size of the node; * terminal nodes contain different plot types, which depend on the nature of the response: bar plots for classification, box plots for regression. In the etree version, some novelties have been introduced: * inner nodes are colored differently to distinguish the various types of covariates: salmon for numeric variables, light blue for nominal variables, yellow for functions, light green for variables in the form of graphs; * when the splitting covariate is structured and split_type = "coeff", the text below the square box is the name of the splitting covariate followed by a dot and the id of the selected component; * when the splitting covariate is structured, values on the edges represent the split point with respect to the selected component of the splitting covariate when split_type = "coeff", or alternatively the size of the corresponding kid node when split_type = "cluster". In the specific fit, not only the partitioning into four groups is correctly retrieved, but also the average response variable in each group is consistent with the mean of the distributions used to generate data. Now we explore how the main optional parameters work and their effects on the structure and size of the resulting trees. ## Changing parameters The two parameters directly influencing the tree size are minbucket and alpha. The first one, minbucket, sets the minimum number of observations each node must contain. When it grows, the tree gets smaller; and viceversa. Let's see what happens if we set it equal to its minimum, i.e., 1. ```{r, fig.dim = c(9, 6)} # Fit with smaller minbucket etree_fit1 <- etree(response = resp_reg, covariates = cov_list, minbucket = 1) # Plot plot(etree_fit1) ``` We actually get a larger tree with five terminal nodes. The additional one derives from the split of node 2 into two nodes with size 26 and 1, respectively. Increasing minbucket to any value would result in avoiding that split. The other parameter controlling the tree size is alpha, which is the significance level in the Energy tests of independence used for variable selection. When alpha decreases, we can expect that the tree gets smaller, because we would only consider stronger associations between the covariates and the response; and viceversa. We can increase alpha to its maximum, which is 1, to see what happens. ```{r, fig.dim = c(12, 7)} # Fit with larger alpha etree_fit2 <- etree(response = resp_reg, covariates = cov_list, alpha = 1) # Plot plot(etree_fit2) ``` We obtain a much larger tree with $7$ terminal nodes. Splits are made with respect to all four covariates, hence showing the full spectrum of colors for the variables' types that are currently accepted by etree(). Another important argument is split_type, which defines the splitting method. By default it is equal to "coeff", so we can now try to fit a tree using "cluster". ```{r, fig.dim = c(7.5, 5)} # Fit with 'cluster' etree_fit3 <- etree(response = resp_reg, covariates = cov_list, split_type = 'cluster') # Plot plot(etree_fit3) ``` In this case, we obtain a tree that is structurally equivalent and has the same hierarchy among covariates as the one produced with split_type = "coeff". The tiny differences in the terminal nodes' size are due to the change in the splitting method for covariates $X_3$ and $X_4$. This suggests that the two splitting strategies produce consistent and sensible results, which are nonetheless different, so it does make sense to grow trees with both of them and compare the output. The plot allows visualizing two differences that are specific to the case of split_type = "cluster": the text below the square box is always the name of the splitting covariate, regardless of its nature; when the splitting covariate is structured, values on the edges represent the corresponding kid node's size. ## Print and other utilities Fitted trees can be printed by using print(). The output produced for the quick fit is the following. ```{r} # Print the fitted tree print(etree_fit) ``` Generally speaking, the first part of the output is given by the Model formula. The second one is the textual tree structure provided in Fitted party. The ids of the nodes are between square brackets, and they are followed, with the only exception of the root, by the name of the splitting covariate (with the addition of the id of the selected component if the covariate is structured and split_type = "coeff"). What follows, ending before the semicolon, is the information on the split that is provided on the edges in plots. After that, the estimated value of the response in the terminal node is provided together with the node size, plus an error for the estimate (weighted MSE for regression, weighted misclassification error for classification). At the end of the output, both the number of inner and terminal nodes is returned. In the specific case, there are three inner nodes and four terminal nodes. Here, we can see more precisely that estimated values for the response in the terminal nodes are very close ($\pm 0.2$) to the theoretical values. The tree can be subsetted using the [ and [[ operators, which yield equivalent results. The integer between parentheses denotes the id of the node from which the subtree is considered. ```{r} # Print a subtree print(etree_fit[5]) ``` ```{r} # Equivalent method to print a subtree print(etree_fit[[5]]) ``` The subtree can be also plotted. ```{r, fig.dim = c(7, 4)} # Plot a subtree plot(etree_fit[5]) ``` Other useful functions are nodeids() and nodeapply(), which are kept unchanged from their partykit version. The first one allows retrieving the ids of the nodes. By default, it returns all the ids. ```{r} # Get all nodes' ids nodeids(etree_fit) ``` We can also use it to get the terminal nodes' ids only. ```{r} # Get all terminal nodes' ids nodeids(etree_fit, terminal = TRUE) ``` We can retrieve the ids of the nodes in the subtree that starts from a given node. ```{r} # Get nodes' ids of a subtree nodeids(etree_fit, from = 5) ``` Function nodeapply() applies a function FUN to the nodes specified via the ids argument. For example, this can be useful for extracting the info possibly contained in the nodes, such as the pvalues for the splitting covariate in the inner nodes. ```{r} # Select inner nodes' ids tnodes <- nodeids(etree_fit, terminal = TRUE) nodes <- 1:max(tnodes) inodes <- nodes[-tnodes] # Retrieve pvalues nodeapply(etree_fit, ids = inodes, FUN = function (n) n$info$pvalue) ``` The dimensions of the tree can be inspected using three functions giving the total number of nodes (length()), the depth of the tree (depth()), and the width of the tree (width()). ```{r} # Total number of nodes length(etree_fit) # Depth depth(etree_fit) # Width width(etree_fit) ``` ## Predict Predictions are obtained in a very classical way: by calling predict() on the fitted Energy Tree. They are based either on the fitted values - if newdata is not provided - or on the new set of covariates - if newdata is specified. All the necessary information, such as the splitting strategy or the type of the task, is automatically retrieved from the fitted object. ```{r} # Predictions from the fitted object pred <- predict(etree_fit) print(pred) ``` The division into nodes obtained partitioning with Energy Trees reduce the response's variability with respect to the original data. Indeed, the Mean Square Error between the response and the predicted values is more than $80\%$ smaller than that between the response and its average. ```{r} # MSE between the response and its average mean((resp_reg - mean(resp_reg)) ^ 2) # MSE between the response and predictions from the fitted tree mean((resp_reg - pred) ^ 2) ``` To understand if this behavior is robust, we can perform the same comparison using a new set of data. We generate $40$ observations using the same mechanism as before, and then we use the new set of covariates to get predictions using predict() with newdata. ```{r} # New set of covariates n_obs <- 40 x1n <- c(runif(n_obs / 4, min = 0, max = 0.55), runif(n_obs / 4 * 3, min = 0.45, max = 1)) x2n <- factor(c(rbinom(n_obs / 4, 1, 0.1), rbinom(n_obs / 4, 1, 0.9), rbinom(n_obs / 2, 1, 0.1))) x3n <- c(fda.usc::rproc2fdata(n_obs / 2, seq(0, 1, len = 100), sigma = 1), fda.usc::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1, mu = rep(1, 100)), fda.usc::rproc2fdata(n_obs / 4, seq(0, 1, len = 100), sigma = 1)) x4n <- c(lapply(1:(n_obs / 4 * 3), function(j) igraph::sample_gnp(100, 0.1)), lapply(1:(n_obs / 4), function(j) igraph::sample_gnp(100, 0.9))) new_cov_list <- list(X1 = x1n, X2 = x2n, X3 = x3n, X4 = x4n) # New response new_resp_reg <- c(rnorm(n_obs / 4, mean = 0, sd = 1), rnorm(n_obs / 4, mean = 2, sd = 1), rnorm(n_obs / 4, mean = 4, sd = 1), rnorm(n_obs / 4, mean = 6, sd = 1)) # Predictions new_pred <- predict(etree_fit, newdata = new_cov_list) print(new_pred) ``` The comparison in terms of MSE is as follows. ```{r} # MSE between the new response and its average mean((new_resp_reg - mean(new_resp_reg)) ^ 2) # MSE between the new response and predictions with the new set of covariates mean((new_resp_reg - new_pred) ^ 2) ``` Hence, the behavior is confirmed also with a new set of data. # Classification The second example is classification. The set of covariates is the same as before, i.e., cov_list, while the response variable is resp_cls. Also in this case, a quick fit can be obtained by calling etree() and specifying the response and the set of covariates. ```{r, fig.dim = c(7, 6)} # Quick fit etree_fit <- etree(response = resp_cls, covariates = cov_list) # Plot plot(etree_fit) ``` The only difference with respect to plots for regression tasks is the content of the terminal panels: in the case of classification, they show the distribution of the response variable for each node using bar plots. In the specific example, the structure and size of the tree is coherent with the true data partition made of four groups, and the levels are almost perfectly recognized. For classification, we will not see any change for the default values of the optional parameters since they work in the same way as for regression. Also printing and the other utilities work in the same way we saw before. Predictions are obtained as for regression, but we cover this part also for classification for the sake of completeness. ```{r} # Predictions from the fitted object pred <- predict(etree_fit) print(pred) ``` We can use the confusion matrix between the response and the predicted values, and a scalar measure such as the misclassification error, to evaluate the predictive ability of the fitted tree. ```{r} # Confusion matrix between the response and predictions from the fitted tree table(pred, resp_cls, dnn = c('Predicted', 'True')) # Misclassification error for predictions from the fitted tree sum(pred != resp_cls) / length(resp_cls) ``` The tree produces sensible predictions for fitted data. We can verify how it generalizes to a new set of covariates. We can use the one generated for regression, but we still need to draw a new response variable for classification. ```{r} # New response new_resp_cls <- factor(c(sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.85, 0.05, 0.05, 0.05)), sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.05, 0.85, 0.05, 0.05)), sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.05, 0.05, 0.85, 0.05)), sample(x = 1:4, size = n_obs / 4, replace = TRUE, prob = c(0.05, 0.05, 0.05, 0.85)))) # Predictions new_pred <- predict(etree_fit, newdata = new_cov_list) print(new_pred) # Confusion matrix between the new response and predictions from the fitted tree table(new_pred, new_resp_cls, dnn = c('Predicted', 'True')) # Misclassification error for predictions on the new set of covariates sum(new_pred != new_resp_cls) / length(new_resp_cls) ``` Taking into account that the set of covariates and the response variable have been generated using a fuzzy scheme both for fitting and testing data, predictions using the new set can be considered satisfactory. # References R. Giubilei, T. Padellini, P. Brutti (2022). Energy Trees: Regression and Classification With Structured and Mixed-Type Covariates. arXiv preprint. https://arxiv.org/pdf/2207.04430.pdf. T. Hothorn, A. Zeileis (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. *Journal of Machine Learning Research*, 16(1):3905--3909.