Machine Learning Random Forests and Boosting - Chad Salinas ::: Data Scientist
Life and times of Chad Salinas
Chad Salinas, golf caddy, data scientist, chad rPubs, recovering chess addict, daddy caddy
114
post-template-default,single,single-post,postid-114,single-format-standard,qode-listing-1.0.1,qode-social-login-1.0,qode-news-1.0,qode-quick-links-1.0,qode-restaurant-1.0,ajax_fade,page_not_loaded,,qode-title-hidden,qode_grid_1300,qode-theme-ver-12.0.1,qode-theme-bridge,bridge,wpb-js-composer js-comp-ver-5.4.2,vc_responsive

Machine Learning Random Forests and Boosting

Key Ideas

In developing a machine learning prediction model, we need to ask questions of each of the variables. We need to determine what measurable quantitative value qualifies a variable for inclusion in one group versus another. So, leveraging machine learning in the Obama-Clinton election, we look at the variable that captures the demographics of a county. Our decision rule for splitting the variable into groups could be that if a county is greater than 20% African-American, then we subdivide into two groups. We advance further through our variables by asking whether the high-school graduation of a county was higher than 78%. We divide that variable into two subgroups. This process continues until we exhaust all of our predictive variables.

Algorithm to Build Decision Tree

  1. Start with one root group with all the variables
  2. Identify the variables that best splits the desired outcome
  3. Create the next level of the binary tree with two leaf nodes
  4. Within each of the two new nodes recursively split on the variables as in #2 above
  5. Continue until we reach the base case of sufficiently small and pure (read homogeneous) groups to predict the outcome
Measures of Impurity
  • Missclassification Error
  • Gini Index
  • Deviance and Information Gain

Example

In RStudio:


data(iris)
library(ggplot2)
names(iris)
table(iris$Species)
// Trying to predict species
// Separate data into training and test sets
inTrain <- createDataPartition(y=iris$Species, p=0.7, list=FALSE)
training <- iris[inTrain, ]
testing <- iris[-inTrain, ]
dim(training);
dim(testing);

// Assert: should split 45/5 into training/testing repsectively
qplot(Petal.Width, Sepal.Width, colour=Species, data = training)

 

Any programming problem can be solved by adding a level of indirection.

– David J. Wheeler

Machine Learning Predcitve Modeling Methods

Bagging

Start with bagging…

Random Forest

Then, build prediction model using Random Forest

Boosting

Contrast the Random Forest with Boosting by tweaking the weights

No Comments

Sorry, the comment form is closed at this time.