Title: | Classification and Regression with Structured and Mixed-Type Data |
---|---|
Description: | Implementation of Energy Trees, a statistical model to perform classification and regression with structured and mixed-type data. The model has a similar structure to Conditional Trees, but brings in Energy Statistics to test independence between variables that are possibly structured and of different nature. Currently, the package covers functions and graphs as structured covariates. It builds upon 'partykit' to provide functionalities for fitting, printing, plotting, and predicting with Energy Trees. Energy Trees are described in Giubilei et al. (2022) <arXiv:2207.04430>. |
Authors: | Riccardo Giubilei [aut, cre] , Tullia Padellini [aut], Pierpaolo Brutti [aut], Marco Brandi [ctb], Gabriel Nespoli [ctb], Torsten Hothorn [ctb] (<https://orcid.org/0000-0001-8301-0471>, (partykit author)), Achim Zeileis [ctb] (<https://orcid.org/0000-0003-0918-3766>, (partykit author)) |
Maintainer: | Riccardo Giubilei <[email protected]> |
License: | GPL-3 |
Version: | 0.1.0 |
Built: | 2024-11-13 03:15:36 UTC |
Source: | https://github.com/ricgbl/etree |
Implementation of Energy Trees, a statistical model to perform classification and regression with structured and mixed-type data. The model has a similar structure to Conditional Trees, but brings in Energy Statistics to test independence between variables that are possibly structured and of different nature. Currently, the package covers functions and graphs as structured covariates. It builds upon 'partykit' to provide functionalities for fitting, printing, plotting, and predicting with Energy Trees.
A simple dataset containing simulated values for a nominal response variable and four covariates of both mixed and partially structured type. The data generation process is based on Example 4.7 (”Signal shape classification”, pages 73-77) from Saito (1994).
data_cls
data_cls
List with two elements: covs
, which is a list containing the
covariates, and resp
, which is a factor of length 150 representing
the response variable. The response variable is divided into three classes
whose labels are cylinder (Cyl
), bell (Bel
) and funnel
(Fun
). The four covariates in covs
all have length 150 and
are characterized as follows:
Nominal: Cyl
observations are given level 1 with probability 0.8
and levels 2 and 3 with probability 0.1 each, Bel
observations are
given level 2 with probability 0.8 and levels 1 and 3 with probability 0.1
each, Fun
observations are given level 3 with probability 0.8 and
levels 1 and 2 with probability 0.1 each;
Numeric: coefficients for one of the basis used to perform the B-splines expansion of the curves that are in turn specified as in Saito (1994);
Functional: curves as specified in Saito (1994);
Graphs: Erd\"os-R\'enyi graphs with connection probability 0.10 for
Cyl
observations, 0.125 for Bel
observations, 0.15 for
Fun
observations.
Saito, N. (1994). Local feature extraction and its applications using a library of bases (Doctoral dissertation, Yale University).
A simple dataset containing simulated values for a numeric response variable and four covariates of both mixed and partially structured type. The data generation process is based on Section 5 (”Example: synthetic data”) from Serban and Wasserman (2005).
data_reg
data_reg
List with two elements: covs
, which is a list containing the
covariates, and resp
, which is a numeric vector of length 200
representing the response variable. The response variable is specified as
in Serban and Wasserman (2005). The four covariates in covs
all have
length 200 and are characterized as follows:
Nominal: level 0 for observations having negative response variable, level 1 otherwise;
Numeric: coefficients for one of the basis used to perform the B-splines expansion of the curves that are in turn specified as in Serban and Wasserman (2005);
Functional: curves as specified in Serban and Wasserman (2005), with 50 observations coming from each of the four curve shapes;
Graphs: Erd\"os-R\'enyi graphs with connection probability given by a transformation of the response variable obtained standardizing between 0.2 and 0.8 its value after adding a normally distributed noise with mean 0 and standard deviation 7.
Serban, N., and Wasserman, L. (2005). CATS: clustering after transformation and smoothing. Journal of the American Statistical Association, 100(471), 990-999.
Compute pairwise distances starting from single objects containing the original univariate observations.
dist_comp(x, lp = 2)
dist_comp(x, lp = 2)
x |
Object containing the original univariate observations. Currently available types and the form they need to have to be correctly recognized are the following:
See Details to find out which distance is used in each case. |
lp |
Integer specifying which norm should be used to compute the distances for functional data. |
The distances used in each case are the following:
Logical: Euclidean distance, implemented via dist()
;
Numeric: Euclidean distance, implemented via dist()
;
Nominal: Gower's distance, implemented via daisy()
;
Functions: -norm, implemented via
metric.lp()
with default options;
Graphs: Edge Difference distance (Hammond et al., 2013), implemented via
nd.edd()
;
Persistence diagrams: Wasserstein distance, implemented via
wasserstein()
with default options;
Object of class "dist"
containing the pairwise distances.
D. K. Hammond, Y. Gur, and C. R. Johnson (2013). Graph diffusion distance: A difference measure for weighted graphs based on the graph laplacian exponential kernel. In 2013 IEEE Global Conference on Signal and Information Processing, pages 419-422.
# Number of observations nobs <- 10 ## Logical obj <- as.logical(rbinom(nobs, 1, 0.5)) d <- dist_comp(obj) ## Integer obj <- rpois(nobs, 5) d <- dist_comp(obj) ## Numeric obj <- rnorm(nobs) d <- dist_comp(obj) ## Factors obj <- factor(letters[1:nobs]) d <- dist_comp(obj) ## Functional data obj <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1) d <- dist_comp(obj) ## Graphs obj <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2)) d <- dist_comp(obj) ## Persistence diagrams x <- lapply(rep(100, nobs), function(np) TDA::circleUnif(np)) obj <- lapply(x, TDA::ripsDiag, maxdimension = 1, maxscale = 3) d <- dist_comp(obj)
# Number of observations nobs <- 10 ## Logical obj <- as.logical(rbinom(nobs, 1, 0.5)) d <- dist_comp(obj) ## Integer obj <- rpois(nobs, 5) d <- dist_comp(obj) ## Numeric obj <- rnorm(nobs) d <- dist_comp(obj) ## Factors obj <- factor(letters[1:nobs]) d <- dist_comp(obj) ## Functional data obj <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1) d <- dist_comp(obj) ## Graphs obj <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2)) d <- dist_comp(obj) ## Persistence diagrams x <- lapply(rep(100, nobs), function(np) TDA::circleUnif(np)) obj <- lapply(x, TDA::ripsDiag, maxdimension = 1, maxscale = 3) d <- dist_comp(obj)
Fits an Energy Forest, in the form of either a bagging of Energy Trees or a
Random Energy Forest, depending on the value of the random_covs
parameter.
eforest( response, covariates, weights = NULL, ntrees = 100, ncores = 1L, minbucket = 1, alpha = 1, R = 500, split_type = "cluster", coeff_split_type = "test", p_adjust_method = "fdr", perf_metric = NULL, random_covs = "auto", verbose = FALSE )
eforest( response, covariates, weights = NULL, ntrees = 100, ncores = 1L, minbucket = 1, alpha = 1, R = 500, split_type = "cluster", coeff_split_type = "test", p_adjust_method = "fdr", perf_metric = NULL, random_covs = "auto", verbose = FALSE )
response |
Response variable, an object of class either
|
covariates |
Set of covariates. Must be provided as a list, where each element is a different variable. Currently available types and the form they need to have to be correctly recognized are the following:
Each element (i.e., variable) in the covariates list must have the same
|
weights |
Optional vector of non-negative integer-valued weights to be used in the fitting process. If not provided, all observations are assumed to have weight equal to 1. |
ntrees |
Number of Energy Trees to grow, i.e., the number of bootstrap samples to be generated and used for fitting. |
ncores |
Number of cores to use, i.e., at most how many child processes
will be run simultaneously. Must be exactly 1 on Windows (which uses the
master process). |
minbucket |
Positive integer specifying the minimum number of observations that each terminal node must contain. Default is 5. |
alpha |
Nominal level controlling the probability of type I error in the Energy tests of independence used for variable selection. Default is 0.05. |
R |
Number of replicates employed to approximate the sampling distribution of the test statistic in every Energy test of independence. Default is 1000. |
split_type |
Splitting method used when the selected covariate is
structured. It has two possible values: |
coeff_split_type |
Method to select the split point for the chosen
component when the selected covariate is structured and |
p_adjust_method |
Multiple-testing adjustment method for P-values,
which can be set to any of the values provided by
|
perf_metric |
Performance metric that is used to compute the Out-Of-Bag
score. If |
random_covs |
Size of the random subset of covariates to choose from at
each split. If set to |
verbose |
Logical indicating whether to print a one-line notification for the conclusion of each tree's fitting process. |
eforest()
generates ntrees
bootstrap samples and then calls
etree()
on each of them. Then, it computes the Out-Of-Bag (OOB)
score using the performance metric defined through perf_metric
.
For classification, possible values of perf_metric
are "BAcc"
and "WBAcc"
. Both are general enough to be used in multiclass
classification problems, still producing sensible results in the case of
binary classification. The two options are based on the calculation of a
ground performance metric, the Balanced Accuracy, which is defined as the
arithmetic mean between Sensitivity and Specificity. In this framework,
Balanced Accuracy is computed using a "One vs. All" approach, i.e.,
considering one class at a time: positive instances are those belonging to
that class, and negatives are the ones belonging to any other class. Then,
the "One vs. All" Balanced Accuracies obtained by considering each class must
be averaged. When perf_metric = "BAcc"
(default for classification
tasks), the average is arithmetic. When perf_metric = "WBAcc"
, the
average is weighted using class sizes, hence giving more importance to the
"One vs. All" Balanced Accuracy of larger classes.
For regression, the default value of perf_metric
is "RMSPE"
,
namely, Root Mean Square Percentage Error. Other available options are
c("MAE", "MAPE", "MedianAE", "MedianAPE", "MSE", "NRMSE", "RAE",
"RMSE", "RMLSE")
. Each of these name points to the corresponding homonym
function from the package MLmetrics
, whose
documentation provides more information about their definition.
Object of class "eforest"
with three elements: 1) ensemble
,
which is a list gathering all the fitted trees; 2) oob_score
,
an object of class "numeric"
representing the OOB score computed using
the performance metric defined through perf_metric
; 3)
perf_metric
, an object of class "character"
returning the
performance metric used for computations.
## Covariates set.seed(123) nobs <- 100 cov_num <- rnorm(nobs) cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5)) cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2)) cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1) cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun) ## Response variable(s) resp_reg <- cov_num ^ 2 y <- round((cov_num - min(cov_num)) / (max(cov_num) - min(cov_num)), 0) resp_cls <- factor(y) ## Regression ## eforest_fit <- eforest(response = resp_reg, covariates = cov_list, ntrees = 12) print(eforest_fit$ensemble[[1]]) plot(eforest_fit$ensemble[[1]]) mean((resp_reg - predict(eforest_fit)) ^ 2) ## Classification ## eforest_fit <- eforest(response = resp_cls, covariates = cov_list, ntrees = 12) print(eforest_fit$ensemble[[12]]) plot(eforest_fit$ensemble[[12]]) table(resp_cls, predict(eforest_fit))
## Covariates set.seed(123) nobs <- 100 cov_num <- rnorm(nobs) cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5)) cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2)) cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1) cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun) ## Response variable(s) resp_reg <- cov_num ^ 2 y <- round((cov_num - min(cov_num)) / (max(cov_num) - min(cov_num)), 0) resp_cls <- factor(y) ## Regression ## eforest_fit <- eforest(response = resp_reg, covariates = cov_list, ntrees = 12) print(eforest_fit$ensemble[[1]]) plot(eforest_fit$ensemble[[1]]) mean((resp_reg - predict(eforest_fit)) ^ 2) ## Classification ## eforest_fit <- eforest(response = resp_cls, covariates = cov_list, ntrees = 12) print(eforest_fit$ensemble[[12]]) plot(eforest_fit$ensemble[[12]]) table(resp_cls, predict(eforest_fit))
Fits an Energy Tree for classification or regression.
etree( response, covariates, weights = NULL, minbucket = 5, alpha = 0.05, R = 1000, split_type = "coeff", coeff_split_type = "test", p_adjust_method = "fdr", random_covs = NULL )
etree( response, covariates, weights = NULL, minbucket = 5, alpha = 0.05, R = 1000, split_type = "coeff", coeff_split_type = "test", p_adjust_method = "fdr", random_covs = NULL )
response |
Response variable, an object of class either
|
covariates |
Set of covariates. Must be provided as a list, where each element is a different variable. Currently available types and the form they need to have to be correctly recognized are the following:
Each element (i.e., variable) in the covariates list must have the same
|
weights |
Optional vector of non-negative integer-valued weights to be used in the fitting process. If not provided, all observations are assumed to have weight equal to 1. |
minbucket |
Positive integer specifying the minimum number of observations that each terminal node must contain. Default is 5. |
alpha |
Nominal level controlling the probability of type I error in the Energy tests of independence used for variable selection. Default is 0.05. |
R |
Number of replicates employed to approximate the sampling distribution of the test statistic in every Energy test of independence. Default is 1000. |
split_type |
Splitting method used when the selected covariate is
structured. It has two possible values: |
coeff_split_type |
Method to select the split point for the chosen
component when the selected covariate is structured and |
p_adjust_method |
Multiple-testing adjustment method for P-values,
which can be set to any of the values provided by
|
random_covs |
Size of the random subset of covariates to choose from
at each split. If set to |
etree()
is the main function of the homonym package. It allows
implementing Energy Trees by simply specifying the response variable, the set
of covariates, and possibly some other parameters. The function is specified
in the same way regardless of the task type: the choice between
classification and regression is automatically made depending on the nature
of the response variable.
Energy Trees (Giubilei et al., 2022) are a recursive partitioning tree-based
model built upon
Conditional Trees (Hothorn et al., 2006). At each step of Energy Trees'
iterative procedure, an Energy test of independence (Szekely et al., 2007) is
performed between the response variable and each of the J covariates. If the
test of global independence (defined as the intersection of the J tests of
partial independence) is not rejected at the significance level set by
alpha
, the recursion is stopped; otherwise, the covariate most
associated with the response in terms of P-value is selected for splitting.
When the covariate is traditional (i.e, numeric or nominal), an Energy test
of independence is performed for each possible split point, and the one
yielding the strongest association with the response is chosen. When the
selected covariate is structured, the split procedure is defined by the value
of split_type
, and possibly by that of coeff_split_type
.
split_type
specifies the splitting method for structured covariates.
It has two possible values:
"coeff"
: in this case, feature vector extraction is used to
transform the structured selected covariate into a set of numeric components
using a representation that is specific to its type. Available
transformations of such a kind are cubic B-spline expansions for functional
data and shell distributions (Carmi et al., 2007) for graphs - obtained
through k-cores (Seidman, 1983), s-cores (Eidsaa and Almaas, 2013), and
d-cores (Giatsidis et al., 2013), for binary, weighted, and directed graphs,
respectively. Then, the component most associated with the response is
selected using Energy tests of independence (Szekely et al., 2007), and the
split point for that component is chosen using the method defined by
coeff_split_type
;
"cluster"
: in this case, the observed values for the structured
selected covariate are used within a Partitioning Around Medoids (Kaufmann
and Rousseeuw, 1987) step to split observations into the two kid nodes.
Medoids calculation and units assignment are performed using
pam()
. Distances are specific to each type of
variable (see dist_comp()
for details).
coeff_split_type
defines the method to select the split point for the
chosen component of the selected structured covariate if and only if
split_type = "coeff"
. It has two possible values:
"test"
: an Energy test of independence (Szekely et al., 2007) is
performed for each possible split point of the chosen component, and the one
yielding the strongest association with the response is selected;
"traditional"
: the split point for the chosen component is
selected as the one minimizing the Gini index (for classification) or the RSS
(for regression) in the two kid nodes.
An object of class "etree"
, "constparty"
, and "party"
.
It stores all the information about the fitted tree. Its elements can be
individually accessed using the $
operator. Their names and content
are the following:
node
: a partynode
object representing
the basic structure of the tree;
data
: a list
containing the data used for the fitting
process. Traditional covariates are included in their original form, while
structured covariates are stored in the form of components if
split_type = "coeff"
or as a factor
whose levels go from 1 to
the total number of observations if split_type = "cluster"
;
fitted
: a data.frame
whose number of rows coincides with
the sample size. It includes the fitted terminal node identifiers (in
"(fitted)"
) and the response values of all observations (in
"(response)"
);
terms
: a terms
object;
names
(optional): names of the nodes in the tree. They can be
set using a character
vector: if its length is smaller than the number
of nodes, the remaining nodes have missing names; if its length is larger,
exceeding names are ignored.
R. Giubilei, T. Padellini, P. Brutti (2022). Energy Trees: Regression and Classification With Structured and Mixed-Type Covariates. arXiv preprint. https://arxiv.org/pdf/2207.04430.pdf.
S. Carmi, S. Havlin, S. Kirkpatrick, Y. Shavitt, and E. Shir (2007). A model of internet topology using k-shell decomposition. Proceedings of the National Academy of Sciences, 104(27):11150-11154.
M. Eidsaa and E. Almaas (2013). S-core network decomposition: A generalization of k-core analysis to weighted networks. Physical Review E, 88(6):062819.
C. Giatsidis, D. M. Thilikos, and M. Vazirgiannis (2013). D-cores: measuring collaboration of directed graphs based on degeneracy. Knowledge and information systems, 35(2):311-343.
T. Hothorn, K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3):651-674.
L. Kaufmann and P. Rousseeuw (1987). Clustering by means of medoids. Data Analysis based on the L1-Norm and Related Methods, pages 405-416.
S. B. Seidman (1983). Network structure and minimum degree. Social networks, 5(3):269-287.
G. J. Szekely, M. L. Rizzo, and N. K. Bakirov (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769-2794.
ctree()
for the partykit
implementation of
Conditional Trees (Hothorn et al., 2006).
## Covariates nobs <- 100 cov_num <- rnorm(nobs) cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5)) cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2)) cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1) cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun) ## Response variable(s) resp_reg <- cov_num ^ 2 y <- round((cov_num - min(cov_num)) / (max(cov_num) - min(cov_num)), 0) resp_cls <- factor(y) ## Regression ## etree_fit <- etree(response = resp_reg, covariates = cov_list) print(etree_fit) plot(etree_fit) mean((resp_reg - predict(etree_fit)) ^ 2) ## Classification ## etree_fit <- etree(response = resp_cls, covariates = cov_list) print(etree_fit) plot(etree_fit) table(resp_cls, predict(etree_fit))
## Covariates nobs <- 100 cov_num <- rnorm(nobs) cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5)) cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2)) cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1) cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun) ## Response variable(s) resp_reg <- cov_num ^ 2 y <- round((cov_num - min(cov_num)) / (max(cov_num) - min(cov_num)), 0) resp_cls <- factor(y) ## Regression ## etree_fit <- etree(response = resp_reg, covariates = cov_list) print(etree_fit) plot(etree_fit) mean((resp_reg - predict(etree_fit)) ^ 2) ## Classification ## etree_fit <- etree(response = resp_cls, covariates = cov_list) print(etree_fit) plot(etree_fit) table(resp_cls, predict(etree_fit))
Methods for objects of class "etree"
.
## S3 method for class 'etree' print( x, FUN = NULL, digits = getOption("digits") - 4, header = NULL, footer = TRUE, ... ) ## S3 method for class 'etree' length(x) ## S3 method for class 'etree' depth(x, root = FALSE, ...) ## S3 method for class 'etree' width(x, ...) ## S3 method for class 'etree' x[i, ...] ## S3 method for class 'etree' x[[i, ...]]
## S3 method for class 'etree' print( x, FUN = NULL, digits = getOption("digits") - 4, header = NULL, footer = TRUE, ... ) ## S3 method for class 'etree' length(x) ## S3 method for class 'etree' depth(x, root = FALSE, ...) ## S3 method for class 'etree' width(x, ...) ## S3 method for class 'etree' x[i, ...] ## S3 method for class 'etree' x[[i, ...]]
x |
Object of class |
FUN |
Function to be applied to nodes. |
digits |
Number of digits to be printed. |
header |
Header to be printed. |
footer |
Footer to be printed. |
... |
Additional arguments. |
root |
Logical indicating whether the root node should be counted in
|
i |
Integer specifying the root of the subtree to extract. |
The print()
method generates a textual representation of the tree.
length()
returns the number of nodes in the tree, depth()
the
depth of the tree and width()
the number of terminal nodes. The subset
methods extract subtrees starting from a given node.
print.etree
: Generates textual representation of the tree.
length.party
: Number of nodes in the tree.
depth.party
: Depth of the three.
width.party
: Number of terminal nodes.
[.etree
: Extract subtrees.
[[.etree
: Extract subtrees.
## Covariates nobs <- 100 cov_num <- rnorm(nobs) cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5)) cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2)) cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1) cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun) ## Response variable resp_reg <- cov_num ^ 2 ## Fit etree_fit <- etree(response = resp_reg, covariates = cov_list) ## Print print(etree_fit) ## Number of nodes in the tree length(etree_fit) ## Depth of the tree depth(etree_fit) ## Number of terminal nodes in the tree width(etree_fit) ## Extract subtrees etree_fit[2] etree_fit[[2]]
## Covariates nobs <- 100 cov_num <- rnorm(nobs) cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5)) cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2)) cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1) cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun) ## Response variable resp_reg <- cov_num ^ 2 ## Fit etree_fit <- etree(response = resp_reg, covariates = cov_list) ## Print print(etree_fit) ## Number of nodes in the tree length(etree_fit) ## Depth of the tree depth(etree_fit) ## Number of terminal nodes in the tree width(etree_fit) ## Extract subtrees etree_fit[2] etree_fit[[2]]
Depth and width of an Energy Tree.
depth(x, ...) width(x, ...)
depth(x, ...) width(x, ...)
x |
An object of class |
... |
Additional arguments. |
depth()
returns the depth of the tree and width()
gives the
number of terminal nodes.
depth
: Depth of the three.
width
: Number of terminal nodes in the tree.
Returns a list of values obtained by applying a function to "etree"
or
"partynode"
objects.
nodeapply(obj, ids = 1, FUN = NULL, ...) ## S3 method for class 'partynode' nodeapply(obj, ids = 1, FUN = NULL, ...) ## S3 method for class 'etree' nodeapply(obj, ids = 1, FUN = NULL, by_node = TRUE, ...)
nodeapply(obj, ids = 1, FUN = NULL, ...) ## S3 method for class 'partynode' nodeapply(obj, ids = 1, FUN = NULL, ...) ## S3 method for class 'etree' nodeapply(obj, ids = 1, FUN = NULL, by_node = TRUE, ...)
obj |
Object of class |
ids |
Integer vector of node identifiers to apply over. |
FUN |
Function to be applied to nodes. By default, the node itself is returned. |
... |
Additional arguments. |
by_node |
Logical indicating whether FUN should be applied to subsets of
|
The method for "partynode"
objects apply function FUN
to all
nodes with node identifiers in ids
. The method for "etree"
objects by default calls the nodeapply
method on the corresponding
node slot. If by_node
is FALSE
, it is applied to the
"etree"
object with root node ids
.
A list of results whose length is given by length(ids)
.
partynode
: nodeapply() method for objects of class "partynode".
etree
: nodeapply() method for objects of class "etree".
## Covariates nobs <- 100 cov_num <- rnorm(nobs) cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5)) cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2)) cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1) cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun) ## Response variable resp_reg <- cov_num ^ 2 ## Fit etree_fit <- etree(response = resp_reg, covariates = cov_list) ## Get pvalues of inner nodes tnodes <- nodeids(etree_fit, terminal = TRUE) nodes <- 1:max(tnodes) inodes <- nodes[-tnodes] nodeapply(etree_fit, ids = inodes, FUN = function(n) n$info$pvalue)
## Covariates nobs <- 100 cov_num <- rnorm(nobs) cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5)) cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2)) cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1) cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun) ## Response variable resp_reg <- cov_num ^ 2 ## Fit etree_fit <- etree(response = resp_reg, covariates = cov_list) ## Get pvalues of inner nodes tnodes <- nodeids(etree_fit, terminal = TRUE) nodes <- 1:max(tnodes) inodes <- nodes[-tnodes] nodeapply(etree_fit, ids = inodes, FUN = function(n) n$info$pvalue)
Extract unique identifiers from inner and terminals nodes of "etree"
or "partynode"
objects.
nodeids(obj, ...) ## S3 method for class 'partynode' nodeids(obj, from = NULL, terminal = FALSE, ...) ## S3 method for class 'etree' nodeids(obj, from = NULL, terminal = FALSE, ...)
nodeids(obj, ...) ## S3 method for class 'partynode' nodeids(obj, from = NULL, terminal = FALSE, ...) ## S3 method for class 'etree' nodeids(obj, from = NULL, terminal = FALSE, ...)
obj |
Object of class |
... |
Additional arguments. |
from |
Integer specifying the node to start from. |
terminal |
Logical indicating whether only identifiers of terminal nodes
should be returned ( |
An integer vector of node identifiers.
partynode
: nodeids() method for objects of class "partynode".
etree
: nodeids() method for objects of class "etree".
## Covariates nobs <- 100 cov_num <- rnorm(nobs) cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5)) cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2)) cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1) cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun) ## Response variable resp_reg <- cov_num ^ 2 ## Fit etree_fit <- etree(response = resp_reg, covariates = cov_list) ## Get all nodes identifiers nodes_ids <- nodeids(etree_fit) ## Get terminal nodes identifiers tnodes_ids <- nodeids(etree_fit, terminal = TRUE) ## Get all nodes identifiers starting from 2 nodes_ids2 <- nodeids(etree_fit, from = 2)
## Covariates nobs <- 100 cov_num <- rnorm(nobs) cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5)) cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2)) cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1) cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun) ## Response variable resp_reg <- cov_num ^ 2 ## Fit etree_fit <- etree(response = resp_reg, covariates = cov_list) ## Get all nodes identifiers nodes_ids <- nodeids(etree_fit) ## Get terminal nodes identifiers tnodes_ids <- nodeids(etree_fit, terminal = TRUE) ## Get all nodes identifiers starting from 2 nodes_ids2 <- nodeids(etree_fit, from = 2)
Returns the plot of an object of class "etree"
.
## S3 method for class 'etree' plot( x, main = NULL, terminal_panel = NULL, tp_args = list(), inner_panel = node_inner, ip_args = list(), edge_panel = edge_simple, ep_args = list(), type = c("extended", "simple"), drop_terminal = NULL, tnex = NULL, newpage = TRUE, pop = TRUE, gp = gpar(), ... )
## S3 method for class 'etree' plot( x, main = NULL, terminal_panel = NULL, tp_args = list(), inner_panel = node_inner, ip_args = list(), edge_panel = edge_simple, ep_args = list(), type = c("extended", "simple"), drop_terminal = NULL, tnex = NULL, newpage = TRUE, pop = TRUE, gp = gpar(), ... )
x |
An object of class |
main |
Optional title for the plot. |
terminal_panel |
Optional panel function of the form
|
tp_args |
List of arguments passed to |
inner_panel |
Optional panel function of the form
|
ip_args |
List of arguments passed to |
edge_panel |
Optional panel function of the form
|
ep_args |
List of arguments passed to |
type |
Character specifying the complexity of the plot:
|
drop_terminal |
Logical indicating whether all terminal nodes should be plotted at the bottom. |
tnex |
Numeric value giving the terminal node extension in relation to the inner nodes. |
newpage |
Logical indicating whether |
pop |
Logical indicating whether the viewport tree should be popped before return. |
gp |
Graphical parameters. |
... |
Additional arguments. |
The plot()
method for "etree"
objects allows for the
visualization of fitted Energy Trees, as returned by
etree()
or as contained in the ensemble
element of a fitted Random Energy Forest.
No return value, called for side effects (plotting the tree).
Compute predictions for objects of class "eforest"
(i.e., as returned
by eforest()
).
## S3 method for class 'eforest' predict(object, newdata = NULL, ...)
## S3 method for class 'eforest' predict(object, newdata = NULL, ...)
object |
A fitted Energy Forest of class |
newdata |
Optional set of new covariates used to make predictions. Must be provided as a list, where each element is a different variable. Currently available types and the form they need to have to be correctly recognized are the following:
Each element (i.e., variable) in the covariates list must have the same
|
... |
Additional arguments. |
The predict()
method for "eforest"
objects computes predictions
for Energy Forests as returned by eforest()
.
Predictions are based either on the fitted values (if newdata
is
NULL
) or on the new set of covariates (when newdata
is
provided). In both cases, each tree in object$ensemble
is used to make
predictions by calling predict()
on it
(with the same specification of newdata
). Then, individual trees'
predictions for any single observation are combined by majority voting rule
for classification or by arithmetic mean for regression.
Predictions, in the form of a factor for classification or as a numeric vector for regression.
Compute predictions for objects of class "etree"
(i.e., fitted Energy
Trees as returned by etree()
, or as contained in
the ensemble
element of a fitted Random Energy Forest).
## S3 method for class 'etree' predict(object, newdata = NULL, perm = NULL, ...)
## S3 method for class 'etree' predict(object, newdata = NULL, perm = NULL, ...)
object |
A fitted Energy Tree of class |
newdata |
Optional set of new covariates used to make predictions. Must be provided as a list, where each element is a different variable. Currently available types and the form they need to have to be correctly recognized are the following:
Each element (i.e., variable) in the covariates list must have the same
|
perm |
Optional character vector of variable names. Splits of nodes
with a primary split in any of these variables will be permuted (after
dealing with surrogates). Note that surrogate split in the |
... |
Additional arguments. |
The predict()
method for "etree"
objects yields predictions for
fitted Energy Trees as returned by etree()
or as
contained in the ensemble
element of a fitted Random Energy Forest.
Predictions are based either on fitted values (if newdata
is
NULL
) or on the new set of covariates (if newdata
is provided).
The values of split_type
and coeff_split_type
, as well as the
number of components for each structured covariate (needed to compute an
equivalent representation for the covariates in newdata
when
split_type = "coeff"
), are automatically retrieved from the object of
class "etree"
.
Predictions, in the form of a factor for classification or as a numeric vector for regression.