Machine Learning Random Forests and Boosting
Key Ideas
In developing a machine learning prediction model, we need to ask questions of each of the variables. We need to determine what measurable quantitative value qualifies a variable for inclusion in one group versus another. So, leveraging machine learning in the Obama-Clinton election, we look at the variable that captures the demographics of a county. Our decision rule for splitting the variable into groups could be that if a county is greater than 20% African-American, then we subdivide into two groups. We advance further through our variables by asking whether the high-school graduation of a county was higher than 78%. We divide that variable into two subgroups. This process continues until we exhaust all of our predictive variables.
Algorithm to Build Decision Tree
- Start with one root group with all the variables
- Identify the variables that best splits the desired outcome
- Create the next level of the binary tree with two leaf nodes
- Within each of the two new nodes recursively split on the variables as in #2 above
- Continue until we reach the base case of sufficiently small and pure (read homogeneous) groups to predict the outcome
Measures of Impurity
- Missclassification Error
- Gini Index
- Deviance and Information Gain
Example
In RStudio:
data(iris)
library(ggplot2)
names(iris)
table(iris$Species)
// Trying to predict species
// Separate data into training and test sets
inTrain <- createDataPartition(y=iris$Species, p=0.7, list=FALSE)
training <- iris[inTrain, ]
testing <- iris[-inTrain, ]
dim(training);
dim(testing);
// Assert: should split 45/5 into training/testing repsectively
qplot(Petal.Width, Sepal.Width, colour=Species, data = training)
Any programming problem can be solved by adding a level of indirection.
– David J. Wheeler
Machine Learning Predcitve Modeling Methods
Bagging
Start with bagging…
Random Forest
Then, build prediction model using Random Forest
Boosting
Contrast the Random Forest with Boosting by tweaking the weights
Sorry, the comment form is closed at this time.