18 Random Forest Modeling

In this tutorial, we’ll explore feature engineering, training and test splitting, and model selecting with random forests. We’ll introduce the caret package, a popular R package with machine learning tools. The dataset we’ll be using is called Bejaia_ForestFires.csv and contains information regarding forest fire conditions in Algeria. We are going to use machine learning practices to extract trends from the dataset and attempt to predict whether a forest fire will occur on a particular day given environmental conditions. Here are the variables within the dataset:

Weather Observations

Date: Day, month (‘june’ to ‘september’) for the year 2012
Temp: Max daily temperature in degrees Celsius: 22 to 42
RH: Relative Humidity in %: 21 to 90
Ws: Wind speed in km/h: 6 to 29
Rain: Total daily precip in mm: 0 to 16.8

Fire Weather Indices

FFMC: Fine Fuel Moisture Code index from the FWI system: 28.6 to 92.5
DMC: Duff Moisture Code index from the FWI system: 1.1 to 65.9
DC: Drought Code index from the FWI system: 7 to 220.4
ISI: Initial Spread Index from the FWI system: 0 to 18.5
BUI: Buildup Index from the FWI system: 1.1 to 68
FWI: Class Weather Index : 0 to 31.1

Class Component

Class : Forest Fire presence (fire) or absence (no fire)

18.1 Prepare the Data

The Bejaia_ForestFires.csv dataset is located in the course datasets folder. Let’s open it and plot the Forest Fire presence/absence (the Class variable).

library(ggplot2)
library(tibble)
ff_data = read.csv("/Users/james/Documents/Github/geog473-673/datasets/Bejaia_ForestFires.csv")
ff_data = as.tibble(ff_data)
# response variable used for classification
ggplot(ff_data, aes(x = Class, fill = Class)) +
  geom_bar()

Now, let’s use the gather function from the tidyverse to evaluate each variable against the Class (aka Fire or No Fire). Once we have our gathered data frame, let’s show a density plot for each variable.

library(tidyverse)
gathered = gather(ff_data, x, y, day:FWI)

ggplot(data = gathered, aes(x = y, color = Class, fill = Class)) +
    geom_density(alpha = 0.3) +
    facet_wrap( ~ x, scales = "free", ncol = 3)

18.2 Dimensionality Reduction

Sometimes datasets contain too many variables and evaluating which are important can be difficult. Fortunately, there are statistical methods we can use in R to reduce the number of dimensions to retain only the high impact variables.

18.2.1 Principal Component Analysis

Principal Component Analysis (PCA) is a popular statistical method that uses correlations and covariances between variables to reduce data that contains many variables. PCA projects the data with fewer dimensions (aka the principal components) using linear combinations across the dataset variables. The dimensions (principal components) are ordered from most explained variance to least explained variance (of the original variables). Principal component analysis can also reveal important features of the data such as outliers and departures from a multinormal distribution. First, we’ll need to cut out the Fire Weather Indices of our dataset. These are calculated based on environmental conditions and are used to help predict forest fires. We’re going to remove these and attempt to create new methods for predicting whether there will be a forest fire or not.

library(ellipse)

# create dataframe with only weather variables - turn Class variable into a factor then turn it into numeric so we can run prcomp
cutdata = data.frame(Temperature = ff_data$Temperature, Rain = ff_data$Rain, RH = ff_data$RH, Ws = ff_data$Ws, Class = as.numeric(as.factor(ff_data$Class)))
head(cutdata)

##   Temperature Rain RH Ws Class
## 1          29  0.0 57 18     2
## 2          29  1.3 61 13     2
## 3          26 13.1 82 22     2
## 4          25  2.5 89 13     2
## 5          27  0.0 77 16     2
## 6          31  0.0 67 14     1

# perform pca using prcomp and extract scores 
pcaOutput <- prcomp(as.matrix(cutdata), scale = TRUE, center = TRUE)
pcaOutput

## Standard deviations (1, .., p=5):
## [1] 1.5839583 0.9756049 0.8161873 0.7594599 0.5443622
## 
## Rotation (n x k) = (5 x 5):
##                    PC1        PC2        PC3         PC4         PC5
## Temperature  0.5424264 -0.2172267 -0.1829743  0.18263364 -0.76925388
## Rain        -0.4447857 -0.3014978 -0.5223649 -0.61317843 -0.24982364
## RH          -0.4889724  0.1941107  0.6351261 -0.05239005 -0.56311381
## Ws          -0.2986886 -0.8190447  0.1429360  0.45867885  0.09557101
## Class       -0.4238282  0.3916610 -0.5194607  0.61443158 -0.14002013

# create separate dataframe
pcaOutput2 <- as.data.frame(pcaOutput$x)
  
# add Class as variable for the groups
pcaOutput2$groups <- ff_data$Class 

# calculate centroids of PC1 and PC2
centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean)

# calculate 95% confidence ellipsoids so we can plot these polygons over ggplot
conf.rgn  <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t)
  data.frame(groups = as.character(t),
             ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]),
                   centre = as.matrix(centroids[centroids$groups == t, 2:3]),
                   level = 0.95),
             stringsAsFactors = FALSE)))

# Plot PC1 and PC2
ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = as.factor(groups), color = as.factor(groups))) + 
    geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) +
    geom_point(size = 2, alpha = 0.6) + 
  labs(title = "Principal Component Analysis of Bejaia Forest Fires: PC1 and PC2")

18.2.2 t-SNE Dimensionality Reduction

TSNE Dimensionality Reduction removes redundant information and preserves high impact information within a dataset. TSNE is similar to PCA with the main difference being that TSNE retains nonlinear variance. First, we’ll calculate a proximity matrix of our ff_data dataframe using the dist function. This function will calculate the relative distances (euclidean) between the rows of the dataset. Then, we’ll use the tsne function from the tsne package.

library(tsne)

sne_data = dist(ff_data)
sne_data = tsne(sne_data)
sne_data = as.data.frame(sne_data)
sne_data$Class = ff_data$Class

ggplot(data = sne_data, aes(x = V1, y = V2, color = Class)) +
    geom_point()

18.2.3 Multidimensional Scaling

Multidimensional scaling (MDS) is a technique that creates a map displaying the relative positions of a number of objects, given only a table of the distances between them. It flows similarly to TSNE in R but instead of using tsne we’ll be using cmdscale to perform classical multidimensional scaling based on the proximity matrix.

# create proximity matrix
prox = dist(ff_data)
# perform mds
m_scale = cmdscale(prox)
m_scale = as.data.frame(m_scale)
# add the class variable back in
m_scale$Class = ff_data$Class

# plot
ggplot(data = m_scale,aes(x = V1, y = V2, color = Class)) +
    geom_point() +
  labs(title = 'Multidimensional Scaling of Bejaia Forest Fire Data')

18.3 Random Forests

Random Forest is a machine learning method for classification or regression predictions. These predictions are based on the generation of multiple decision trees trees. Decision trees serve as the building block for random forests. The caret package is a popular machine learning package in R that can be used to create random forest models. Our goal is to create a random forest model that predicts the presence/absence of a forest fire given environmental conditions.

18.3.1 Decision Trees

Classification in R can be done using the rpart and rpart.plot packages. These can be used to calculate and show decision trees (aka classification trees) for a given scenario. In our case, we’ll calculate and plot the decision tree for whether or not there is a forest fire. Before we proceed, when we calculate random values going into a model in R, it’s good practice to use the set.seed() function. The set.seed() function sets the starting number used to generate a sequence of random numbers – it ensures that you get the same result if you start with that same seed each time you run the same process.

library(rpart)
library(rpart.plot)
library(caret)

# cut the data but htis time preserve 'Fire' and 'No Fire' characters of the ff_data$Class
cutdata = data.frame(Temperature = ff_data$Temperature, Rain = ff_data$Rain, RH = ff_data$RH, Ws = ff_data$Ws, Class = ff_data$Class)
cutdata$Class = as.factor(cutdata$Class)

# now that it's a factor, we need to redefine the levels again
levels(cutdata$Class) = c("Fire", "No Fire")

# set the seed
set.seed(42)
# calculate the rpart model
fit <- rpart(Class ~ .,
            data = cutdata,
            method = "class",
            control = rpart.control(xval = 10, 
                                    minbucket = 2, 
                                    cp = 0), 
             parms = list(split = "information"))

# plot the fit!
rpart.plot(fit, extra = 100)

18.3.2 Data Partition

The goal of this exercise is to predict the presence/absence of a forest fire given environmental conditions. We’ll need to provide the model with a training dataset to learn from and a test dataset to test itself against. This can be done using createDataPartition from the caret package. Then, let’s calculate a correlation matrix using corrplot.

# set the seed
set.seed(42)
# create test and train datasets
index <- createDataPartition(as.factor(cutdata$Class), p = 0.7, list = FALSE)
train_data <- cutdata[index, ]
test_data  <- cutdata[-index, ]

library(corrplot)

# calculate correlation matrix
num_train_data = train_data
num_train_data$Class = as.numeric(num_train_data$Class)
corMatMy <- cor(num_train_data)
corrplot(corMatMy, order = "hclust")

highlyCor <- colnames(train_data)[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]

## All correlations <= 0.7

18.3.3 Training the Model

Now, let’s use our training dataset to train a model using random forest methods. Remember, we’re training our model to predict the Class variable (i.e. Fire vs No Fire) based on the other variables (temperature, relative humidity, etc.)

# set the seed
set.seed(42)

# create the model
model_rf <- caret::train(Class ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 5, 
                                                  repeats = 3, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))

# print the random forest model output 
model_rf

## Random Forest 
## 
## 87 samples
##  4 predictor
##  2 classes: 'Fire', 'No Fire' 
## 
## Pre-processing: scaled (4), centered (4) 
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 70, 70, 69, 70, 69, 69, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8082789  0.6159185
##   3     0.8198257  0.6380856
##   4     0.8274510  0.6524226
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.

# print out the confusion matrix 
model_rf$finalModel$confusion

##         Fire No Fire class.error
## Fire      34       8   0.1904762
## No Fire    8      37   0.1777778

# plot variable importance of the model
importance <- varImp(model_rf, scale = TRUE)
plot(importance)

Rain is the most important variable for the presence/absence of forest fires. Relative Humidity and Temperature carry importance but not as much as Rain. Wind Speed doesn’t seem to be all that important.

18.3.4 Testing the Model

In order to test our Random Forest model, model_rf, we’ll need to use a confusion matrix. A Confusion matrix is a matrix used for evaluating the performance of a classification model. The matrix compares the actual target values (the testing data class) with those predicted by model_rf. This gives us a holistic view of how well our random forest model is performing and shows what kinds of errors it is making. After we get some statistics from the confusino matrix, well need to run the model using the predict function on the test_data.

# Calculate the confusion matrix
confusionMatrix(predict(model_rf, test_data), as.factor(test_data$Class))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fire No Fire
##    Fire      16       3
##    No Fire    1      15
##                                          
##                Accuracy : 0.8857         
##                  95% CI : (0.7326, 0.968)
##     No Information Rate : 0.5143         
##     P-Value [Acc > NIR] : 3.724e-06      
##                                          
##                   Kappa : 0.772          
##                                          
##  Mcnemar's Test P-Value : 0.6171         
##                                          
##             Sensitivity : 0.9412         
##             Specificity : 0.8333         
##          Pos Pred Value : 0.8421         
##          Neg Pred Value : 0.9375         
##              Prevalence : 0.4857         
##          Detection Rate : 0.4571         
##    Detection Prevalence : 0.5429         
##       Balanced Accuracy : 0.8873         
##                                          
##        'Positive' Class : Fire           
##

# Predict the model 
results <- data.frame(actual = test_data$Class,
                      predict(model_rf, test_data, type = "prob"))

# rename the variables/columns
names(results) = c('actual', 'Fire', 'No.Fire')

# declare fire vs no fire based on probabilities
results$prediction <- ifelse(results$Fire > 0.5, "Fire",
                             ifelse(results$No.Fire > 0.5, "No Fire", NA))

# declare whether the model was right or wrong (true or false)
results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE)

# plot the results
ggplot(results, aes(x = prediction, fill = correct)) +
  geom_bar(position = "dodge") + 
  labs(title = "Random Forest Performance", xlab = "Prediction")

# plot thhe results using geom_jitter
ggplot(results, aes(x = prediction, y = Fire, color = correct, shape = correct)) +
  geom_jitter(size = 3, alpha = 0.9) + 
  labs(title = "Random Forest Performance", xlab = "Prediction")

18.3.5 Hyperparameter Tuning

Models like the random forest model we constructed above are built based on the predictor variables. Model parameters are constructed based on these variables, although hyperparameters are not model parameters and they cannot be directly trained from the data. Model parameters are learned during training when we optimize a loss function. Sometimes, the construction of the architecture (i.e. format of a decision tree) may not be optimal. The act of finding the optimal model archicture is called hyperparameter tuning. We can tune our model by applying a tuneGrid argument with the caret::train() function. Our tuneGrid will be defined by mtry : The number of variables randomly sampled as candidates at each split.

# plot the original model
plot(model_rf)

# create hyperparameter grid
set.seed(42)
grid <- expand.grid(mtry = c(1:10))

model_rf_tune_man <- caret::train(Class ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE),
                         tuneGrid = grid)

model_rf_tune_man

## Random Forest 
## 
## 87 samples
##  4 predictor
##  2 classes: 'Fire', 'No Fire' 
## 
## Pre-processing: scaled (4), centered (4) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 78, 78, 79, 79, 77, 79, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.8350278  0.6714680
##    2    0.8176667  0.6355919
##    3    0.8154722  0.6311586
##    4    0.8114167  0.6214337
##    5    0.8171389  0.6328296
##    6    0.8147778  0.6282390
##    7    0.8139167  0.6264313
##    8    0.8135278  0.6255191
##    9    0.8161389  0.6308268
##   10    0.8158889  0.6305575
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.

plot(model_rf_tune_man)

Notice hoe many more predictors we get. Our dataset is not the best example for this practice because our dataset only has a handful of predictors. However, The search of the tuneGrid function above is not random by default. Let’s perform a random search of the tuning parmaeters and provide a tuneLength - An integer denoting the amount of granularity in the tuning parameter grid.

model_rf_tune_auto <- caret::train(Class ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE,
                                                  search = "random"),
                         tuneGrid = grid,
                         tuneLength = 15)
model_rf_tune_auto

## Random Forest 
## 
## 87 samples
##  4 predictor
##  2 classes: 'Fire', 'No Fire' 
## 
## Pre-processing: scaled (4), centered (4) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 78, 79, 78, 77, 79, 79, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.8441111  0.6903933
##    2    0.8262778  0.6533746
##    3    0.8163333  0.6317828
##    4    0.8173611  0.6326702
##    5    0.8152500  0.6288156
##    6    0.8196111  0.6372856
##    7    0.8127778  0.6232523
##    8    0.8197222  0.6372499
##    9    0.8161389  0.6300520
##   10    0.8137500  0.6254779
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.

plot(model_rf_tune_auto)

As we see above, random forest models can quickly become complex due to the fact that there’s uncertainty in any classification prediction. A particular model can require tuning as seen above. Once again, this practice is much more useful for a dataset with more predictors but it’s important to show how it can be done here.

18.4 Assignment

Perform the exact same analysis for the SidiBel_ForestFires.csv dataset (this has the same variables as Bejaia_ForestFires.csv). Proceed with the same workflow and create the same plots. Add plots to a document (Word, Google Docs, or RMarkdown file, your choice).

Upload file (as word doc, PDF, whatever works) to UD Canvas.