[2017-02-01] Regression is now added to RRF. [2014-07-08] Introducing "inTrees" (interpretable trees), a framework and an R package for extracting, measuring, pruning, selecting and summarizing rules from a tree ensemble (so far including random forest, RRF and gbm). For Latex user: these rules can be easily formatted as latex code. Feedback can be sent to: hdeng3@asu.edu Technical details: Houtao Deng, George Runger, "Gene Selection with Guided Regularized Random Forest", Pattern Recognition, 46.12 (2013): 3483-3489 >> This paper describes the algorithms used in the RRF package. The guided RRF is an enhanced RRF which is guided by the importance scores from an ordinary random forest. Houtao Deng, George Runger, "Feature Selection via Regularized Trees", The 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, 2012. >> This paper describes the regularized trees including the original RRF. Houtao Deng, "Guided Random Forest in the RRF Package", arXiv:1306.0237, 2013. >> One can assign a weight [0,1] to each feature. Then, the features with larger weights will be preferred when building the GRF. >> In the paper the weight is decided by importance scores from an ordinary random forest. But it can be determined by other ways, e.g., expert insights on which features may be more important. >> Experiments on 10 high-dimensional gene data sets (#features >> #instances) show that, with a fixed parameter value (without tuning the parameter), RF applied to features selected by GRF outperforms RF applied to all features on 9 data sets and 7 of them have significant differences at the 0.05 level. Thus for these data sets, GRF improve RF's performance by removing irrelevant features. Sample code: Gene data sets: Code slides on variable (feature) ranking and selection using random forests Usage of RRF/GRRF library(RRF); # install RRF package before running the script # Generate Data and Visualize; X1 and X21 are the truely useful features # Feature selection via RRF # Feature selection via GRRF print(subsetRRF) #the subset includes many more noisy variables than GRRF ====== Notes ====== If RRF and randomForest packages are loaded in the same R session, you may use "package::function" for functions appear to be the same in both packages. For example: RRF::importance() or randomForest::importance() Feature selection + classification procedure: OOB ERROR in (G)RRF should not be used for performance evaluation. The difference between RRF and guided RRF? Guided RRF is recommended. RRF is greedy in the feature selection process: variables are selected based on a subsample of data/variables at each node. Guided RRF uses the importance scores from an ordinary random forest to guide the feature selection process of RRF. The motivation is to assign smaller penalites to variables with larger importance scores. Note both RRF and Guided RRF can be used in the 'RRF' package. But if classification accuracy is the only goal, you may use RRF with minimal regularization, that is, coefReg = 1. The number of features used in the RRF, though, can be large. The difference between Guided Random Forest and (G)RRF? GRF selects a subset of RELEVANT features, while (G)RRF selects a subset of relevant and non-redundant features. GRF often selects a lot more features than GRRF (sometimes most of the features), but it may lead to better classification accuracy than GRRF (demonstrated in Guided Random Forest in the RRF Package). GRF can be implemented in a distributed computing framework such as Hadoop as each base tree is built independently. The key difference between RRF and RF's importance score ? The relationship between RRF and RF is
similar to the relationship between LASSO and ordinary regression.
For a large number of instances, how to make RF/RRF/GRRF faster ? You can increase the "nodesize"
variable. The default value of nodesize is 1, i.e., grow complete trees, which
may not be necessary when there is a large number of instances.
|