--- title: "Variable Selection Methods" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Variable Selection Methods} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, message=FALSE} library(olsrr) library(ggplot2) library(gridExtra) library(nortest) library(goftest) ``` ## Introduction Variable selection refers to the process of choosing the most relevant variables to include in a regression model. They help to improve model performance and avoid over fitting. Before we explore stepwise selection methods, let us take a quick look at all/best subset regression. As they evaluate every possible variable combination, these methods are computationally intensive and may crash your system if used with a large set of variables. We have included them in the package purely for educational purpose. ### All Possible Regression All subset regression tests all possible subsets of the set of potential independent variables. If there are K potential independent variables (besides the constant), then there are $2^{k}$ distinct subsets of them to be tested. For example, if you have 10 candidate independent variables, the number of subsets to be tested is $2^{10}$, which is 1024, and if you have 20 candidate variables, the number is $2^{20}$, which is more than one million. ```{r allsub} model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars) ols_step_all_possible(model) ``` ### Best Subset Regression Select the subset of predictors that do the best at meeting some well-defined objective criterion, such as having the largest R2 value or the smallest MSE, Mallow's Cp or AIC. ```{r bestsub, size='tiny'} model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars) ols_step_best_subset(model) ``` ## Stepwise Selection Stepwise regression is a method of fitting regression models that involves the iterative selection of independent variables to use in a model. It can be achieved through forward selection, backward elimination, or a combination of both methods. The forward selection approach starts with no variables and adds each new variable incrementally, testing for statistical significance, while the backward elimination method begins with a full model and then removes the least statistically significant variables one at a time. ### Model We will use the below model throughout this article except in the case of hierarchical selection. You can learn more about the data [here](https://olsrr.rsquaredacademy.com/reference/surgical). ```{r model} model <- lm(y ~ ., data = surgical) summary(model) ``` ### Model specification Irrespective of the stepwise method being used, we have to specify the full model i.e. all the variabels/predictors under consideration as `olsrr` extracts the candidate variables for selection/elimination from the model specified. ##### Forward selection ```{r stepf1} # stepwise forward regression ols_step_forward_p(model) ``` ##### Backward elimination ```{r stepb} # stepwise backward regression ols_step_backward_p(model) ``` ### Criteria The criteria for selecting variables may be one of the following: - p value - akaike information criterion (aic) - schwarz bayesian criterion (sbc) - sawa bayesian criterion (sbic) - r-square - adjusted r-square ### Include/exclude variables We can force variables to be included or excluded from the model at all stages of variable selection. The variables may be specified either by name or position in the model specified. ##### By name ```{r include_name} ols_step_forward_p(model, include = c("age", "alc_mod")) ``` ##### By index ```{r include_index} ols_step_forward_p(model, include = c(5, 7)) ``` ### Standardized output All stepwise selection methods display standard output which includes: - selection summary - model summary - ANOVA - parameter estimates ```{r output} # adjusted r-square ols_step_forward_adj_r2(model) ``` ### Visualization Use the `plot()` method to visualize variable selection. It will display how the variable selection criteria changes at each step of the selection process along with the variable selected. ```{r visualize} # adjusted r-square k <- ols_step_forward_adj_r2(model) plot(k) ``` ### Verbose output To view the detailed regression output at each stage of variable selection/elimination, set `details` to `TRUE`. It will display the following information at each step: - step number - variable selected/eliminated - model - value of the criteria at that stage ```{r details} # adjusted r-square ols_step_forward_adj_r2(model, details = TRUE) ``` ### Progress To view the progress in the variable selection procedure, set `progress` to `TRUE`. It will display the variable being selected/eliminated at each step until there are no more candidate variables left. ```{r progress} # adjusted r-square ols_step_forward_adj_r2(model, progress = TRUE) ``` ### Hierarchical selection When using `p` values as the criterion for selecting/eliminating variables, we can enable hierarchical selection. In this method, the search for the most significant variable is restricted to the next available variable. In the below example, as `liver_test` does not meet the threshold for selection, none of the variables after `liver_test` are considered for further selection i.e. the stepwise selection ends as soon as it comes across a variable that does not meet the selection threshold. You can learn more about hierachichal selection [here](https://www.stata.com/manuals/rstepwise.pdf). ```{r hierarchical} # hierarchical selection m <- lm(y ~ bcs + alc_heavy + pindex + enzyme_test + liver_test + age + gender + alc_mod, data = surgical) ols_step_forward_p(m, 0.1, hierarchical = TRUE) ```