Palmer Penguins Data Analysis Series (Part 5): Random Forest vs Linear Models - The Final Comparison

Settling the interpretability vs performance debate in ecological modeling

The final part of our 5-part series comparing linear models with random forests and providing guidance for model selection in ecological research

Author

Your Name

Published

January 5, 2025

PDF Version Available
This article can be generated as a PDF for offline reading or printing.

To generate PDF: Run this command in your terminal:
./generate_pdf.sh posts/palmer_penguins_part5/index.qmd

Two penguins at a crossroads - one holding a linear regression equation, the other holding a decision tree, representing the classic interpretability vs performance tradeoff!

Photo: African penguins at Boulders Beach, South Africa. Licensed under CC BY 2.0 via Wikimedia Commons

🐧 Palmer Penguins Data Analysis Series - FINALE

This is Part 5 (Final) of our 5-part series exploring penguin morphometrics:

Part 1: EDA and Simple Regression
Part 2: Multiple Regression and Species Effects
Part 3: Advanced Models and Cross-Validation
Part 4: Model Diagnostics and Interpretation
Part 5: Random Forest vs Linear Models (This post - FINALE!)

1 Introduction

Welcome to the grand finale of our Palmer penguins data analysis journey! We’ve traveled far together - from basic exploratory analysis to sophisticated diagnostics, building and validating models that explain 86% of the variance in penguin body mass. But one crucial question has been lurking beneath the surface throughout our journey:

Should we sacrifice the beautiful interpretability of linear models for potentially better predictive performance from machine learning algorithms?

This question sits at the heart of modern data science, especially in scientific research where understanding why is often as important as predicting what. Today, we’ll conduct a comprehensive head-to-head comparison between our carefully crafted linear models and their machine learning challenger: random forests.

In this final post, we’ll explore:

Comprehensive performance comparison using multiple metrics
Feature importance analysis and partial dependence plots
The interpretability-performance tradeoff in action
Practical guidance for model selection in ecological research
When to choose linear models vs. random forests
Best practices for communicating model choices to stakeholders

By the end of this series finale, you’ll have a complete framework for making informed decisions about model complexity in your own research.

2 Setup and Model Arsenal

Let’s assemble our complete modeling toolkit and establish fair comparison conditions:

library(palmerpenguins)
library(tidyverse)
library(randomForest)
library(caret)
library(broom)
library(knitr)
library(patchwork)

# Conditional loading of optional packages
optional_packages <- c("pdp", "vip", "DALEX")
for (pkg in optional_packages) {
  if (requireNamespace(pkg, quietly = TRUE)) {
    library(pkg, character.only = TRUE)
    cat("[OK] Loaded", pkg, "\n")
  } else {
    cat("[INFO] Package '", pkg, "' not available. Install with: install.packages('", pkg, "')\n")
  }
}

[INFO] Package ' pdp ' not available. Install with: install.packages(' pdp ')
[INFO] Package ' vip ' not available. Install with: install.packages(' vip ')
[INFO] Package ' DALEX ' not available. Install with: install.packages(' DALEX ')

# Set theme and colors
theme_set(theme_minimal(base_size = 12))
penguin_colors <- c("Adelie" = "#FF6B6B", "Chinstrap" = "#4ECDC4", "Gentoo" = "#45B7D1")

# Load and prepare data
data(penguins)
penguins_clean <- penguins %>% drop_na()

# Set seed for reproducibility
set.seed(42)

cat("=== FINALE: The Ultimate Model Comparison ===\n")

=== FINALE: The Ultimate Model Comparison ===

cat("============================================\n")

============================================

cat(sprintf("Dataset: %d penguins across %d variables\n", nrow(penguins_clean), ncol(penguins_clean)))

Dataset: 333 penguins across 8 variables

cat("Comparison: Linear Models vs Random Forests\n")

Comparison: Linear Models vs Random Forests

cat("Evaluation: Cross-validation with multiple metrics\n")

Evaluation: Cross-validation with multiple metrics

cat("Goal: Determine optimal approach for ecological modeling\n")

Goal: Determine optimal approach for ecological modeling

3 Comprehensive Model Training

Let’s train our complete arsenal of models using identical cross-validation procedures:

3.1 Cross-Validation Setup

# Consistent cross-validation setup
train_control <- trainControl(
  method = "cv",
  number = 10,
  savePredictions = "final",
  verboseIter = FALSE
)

cat("--- Experimental Design ---\n")

--- Experimental Design ---

cat("=======================\n")

=======================

cat("Cross-validation: 10-fold\n")

Cross-validation: 10-fold

cat("Seed: 42 (consistent across all models)\n")

Seed: 42 (consistent across all models)

cat("Metrics: RMSE, R-squared, MAE\n")

Metrics: RMSE, R-squared, MAE

cat("Evaluation: Performance + interpretability\n")

Evaluation: Performance + interpretability

3.2 Linear Model Variants

# Simple linear model
linear_simple <- train(
  body_mass_g ~ flipper_length_mm,
  data = penguins_clean,
  method = "lm",
  trControl = train_control
)

# Multiple regression
linear_multiple <- train(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = penguins_clean,
  method = "lm", 
  trControl = train_control
)

# Species-aware model (our champion from previous parts)
linear_species <- train(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species,
  data = penguins_clean,
  method = "lm",
  trControl = train_control
)

# Polynomial model
linear_poly <- train(
  body_mass_g ~ poly(flipper_length_mm, 2) + poly(bill_length_mm, 2) + 
                poly(bill_depth_mm, 2) + species,
  data = penguins_clean,
  method = "lm",
  trControl = train_control
)

cat("[OK] Linear models trained:\n")

[OK] Linear models trained:

cat("• Simple (flipper only)\n")

• Simple (flipper only)

cat("• Multiple (all morphometrics)\n")

• Multiple (all morphometrics)

cat("• Species-aware (morphometrics + species)\n")

• Species-aware (morphometrics + species)

cat("• Polynomial (quadratic features + species)\n")

• Polynomial (quadratic features + species)

3.3 Random Forest Variants

# Basic random forest (morphometrics only)
rf_basic <- train(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = penguins_clean,
  method = "rf",
  trControl = train_control,
  ntree = 500,
  importance = TRUE
)

note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .

# Full random forest (all available predictors)
rf_full <- train(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
                species + sex + island + year,
  data = penguins_clean,
  method = "rf",
  trControl = train_control,
  ntree = 500,
  importance = TRUE
)

# Tuned random forest (optimized hyperparameters)
rf_tuned <- train(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
                species + sex + island,
  data = penguins_clean,
  method = "rf",
  trControl = train_control,
  tuneGrid = expand.grid(mtry = c(2, 3, 4, 5)),
  ntree = 500,
  importance = TRUE
)

cat("\n[OK] Random forest models trained:\n")


[OK] Random forest models trained:

cat("• Basic (morphometrics only)\n")

• Basic (morphometrics only)

cat("• Full (all predictors)\n")

• Full (all predictors)

cat("• Tuned (optimized hyperparameters)\n")

• Tuned (optimized hyperparameters)

4 Performance Comparison

4.1 Comprehensive Performance Table

# Compile all models
all_models <- list(
  "Linear: Simple" = linear_simple,
  "Linear: Multiple" = linear_multiple,
  "Linear: Species" = linear_species,
  "Linear: Polynomial" = linear_poly,
  "RF: Basic" = rf_basic,
  "RF: Full" = rf_full,
  "RF: Tuned" = rf_tuned
)

# Extract performance metrics
performance_comparison <- map_dfr(all_models, function(model) {
  results <- model$results
  best_idx <- which.min(results$RMSE)
  
  data.frame(
    RMSE_mean = results$RMSE[best_idx],
    RMSE_sd = sd(model$resample$RMSE),
    Rsquared_mean = results$Rsquared[best_idx],
    Rsquared_sd = sd(model$resample$Rsquared),
    MAE_mean = results$MAE[best_idx],
    MAE_sd = sd(model$resample$MAE)
  )
}, .id = "Model") %>%
  arrange(RMSE_mean) %>%
  mutate(
    Rank = row_number(),
    Model_Type = ifelse(str_detect(Model, "Linear"), "Linear", "Random Forest")
  )

# Format for display
performance_display <- performance_comparison %>%
  mutate(
    RMSE = sprintf("%.1f ± %.1f", RMSE_mean, RMSE_sd),
    R_squared = sprintf("%.3f ± %.3f", Rsquared_mean, Rsquared_sd),
    MAE = sprintf("%.1f ± %.1f", MAE_mean, MAE_sd)
  ) %>%
  select(Rank, Model, Model_Type, RMSE, R_squared, MAE)

kable(performance_display,
      caption = "Complete Model Performance Comparison (Cross-Validated)",
      col.names = c("Rank", "Model", "Type", "RMSE (g)", "R²", "MAE (g)"))

Complete Model Performance Comparison (Cross-Validated)
Rank	Model	Type	RMSE (g)	R²	MAE (g)
1	RF: Tuned	Random Forest	294.9 ± 45.6	0.866 ± 0.050	235.9 ± 38.5
2	RF: Full	Random Forest	296.0 ± 29.9	0.870 ± 0.024	236.7 ± 25.2
3	Linear: Polynomial	Linear	310.3 ± 42.9	0.858 ± 0.035	249.4 ± 34.1
4	Linear: Species	Linear	315.7 ± 32.2	0.856 ± 0.022	251.6 ± 24.8
5	RF: Basic	Random Forest	341.4 ± 36.2	0.827 ± 0.042	269.0 ± 38.7
6	Linear: Simple	Linear	390.9 ± 54.0	0.775 ± 0.038	314.1 ± 42.9
7	Linear: Multiple	Linear	392.6 ± 41.7	0.769 ± 0.049	312.9 ± 27.3

# Identify top performers
top_linear <- performance_comparison %>% filter(Model_Type == "Linear") %>% slice_min(RMSE_mean)
top_rf <- performance_comparison %>% filter(Model_Type == "Random Forest") %>% slice_min(RMSE_mean)

cat("\n=== Top Performers ===\n")


=== Top Performers ===

cat("==================\n")

==================

cat(sprintf("Best Linear Model: %s (RMSE: %.1f)\n", top_linear$Model, top_linear$RMSE_mean))

Best Linear Model: Linear: Polynomial (RMSE: 310.3)

cat(sprintf("Best Random Forest: %s (RMSE: %.1f)\n", top_rf$Model, top_rf$RMSE_mean))

Best Random Forest: RF: Tuned (RMSE: 294.9)

cat(sprintf("Performance Gap: %.1f grams RMSE\n", top_linear$RMSE_mean - top_rf$RMSE_mean))

Performance Gap: 15.4 grams RMSE

4.2 Statistical Significance Testing

# Compare top models statistically
top_models <- list(
  "Best Linear" = all_models[[top_linear$Model]],
  "Best RF" = all_models[[top_rf$Model]]
)

model_resamples <- resamples(top_models)
summary(model_resamples)


Call:
summary.resamples(object = model_resamples)

Models: Best Linear, Best RF 
Number of resamples: 10 

MAE 
                Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
Best Linear 200.8293 224.4082 246.0062 249.3694 262.8090 313.4748    0
Best RF     192.8103 213.8498 226.3612 235.9357 248.0156 321.0243    0

RMSE 
                Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
Best Linear 249.1883 278.4448 300.9560 310.2954 335.3805 376.1442    0
Best RF     240.0675 261.9729 278.6417 294.9201 319.6756 381.0386    0

Rsquared 
                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
Best Linear 0.7875702 0.8395449 0.8599082 0.8579618 0.8831732 0.9028425    0
Best RF     0.7337503 0.8651822 0.8712287 0.8663053 0.8872005 0.9195034    0

# Statistical test
model_diff <- diff(model_resamples)
summary(model_diff)


Call:
summary.diff.resamples(object = model_diff)

p-value adjustment: bonferroni 
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0

MAE 
            Best Linear Best RF
Best Linear             13.43  
Best RF     0.2413             

RMSE 
            Best Linear Best RF
Best Linear             15.38  
Best RF     0.4359             

Rsquared 
            Best Linear Best RF  
Best Linear             -0.008343
Best RF     0.5137

cat("\n--- Statistical Comparison ---\n")


--- Statistical Comparison ---

cat("==========================\n")

==========================

cat("Testing: Best Linear vs Best Random Forest\n")

Testing: Best Linear vs Best Random Forest

cat("Metric: RMSE difference\n")

Metric: RMSE difference

4.3 Performance Visualization

# Create comprehensive comparison plots
all_cv_results <- map_dfr(all_models, function(model) {
  data.frame(
    RMSE = model$resample$RMSE,
    Rsquared = model$resample$Rsquared,
    MAE = model$resample$MAE
  )
}, .id = "Model") %>%
  mutate(
    Model_Type = ifelse(str_detect(Model, "Linear"), "Linear", "Random Forest"),
    Model = factor(Model, levels = performance_comparison$Model)
  )

# RMSE comparison
p1 <- ggplot(all_cv_results, aes(x = Model, y = RMSE, fill = Model_Type)) +
  geom_boxplot(alpha = 0.7) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "white") +
  scale_fill_manual(values = c("Linear" = "#E74C3C", "Random Forest" = "#27AE60")) +
  labs(title = "RMSE Comparison Across All Models",
       subtitle = "Lower is better; white diamonds show means",
       y = "RMSE (grams)", fill = "Model Type") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# R² comparison
p2 <- ggplot(all_cv_results, aes(x = Model, y = Rsquared, fill = Model_Type)) +
  geom_boxplot(alpha = 0.7) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 3, fill = "white") +
  scale_fill_manual(values = c("Linear" = "#E74C3C", "Random Forest" = "#27AE60")) +
  labs(title = "R² Comparison Across All Models",
       subtitle = "Higher is better; white diamonds show means",
       y = "R-squared", fill = "Model Type") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

performance_viz <- p1 / p2
print(performance_viz)

Comprehensive performance comparison showing RMSE and R² distributions across all models

5 Feature Importance Analysis

5.1 Random Forest Variable Importance

# Extract variable importance from best RF model
best_rf_model <- all_models[[top_rf$Model]]

# Variable importance plot (requires vip package)
if (requireNamespace("vip", quietly = TRUE)) {
  rf_importance <- vip::vip(best_rf_model, num_features = 10)
  print(rf_importance)
} else {
  cat("⚠️ vip package not available for importance plots\n")
}

⚠️ vip package not available for importance plots

# Get importance scores
importance_scores <- varImp(best_rf_model)$importance %>%
  rownames_to_column("Variable") %>%
  arrange(desc(Overall)) %>%
  mutate(
    Scaled_Importance = Overall / max(Overall) * 100,
    Variable = factor(Variable, levels = Variable)
  )

kable(importance_scores, 
      caption = "Random Forest Variable Importance Rankings",
      col.names = c("Variable", "Importance", "Scaled (%)"),
      digits = 1)

Random Forest Variable Importance Rankings
Variable	Importance	Scaled (%)
sexmale	100.0	100.0
speciesGentoo	60.6	60.6
flipper_length_mm	53.9	53.9
bill_depth_mm	50.4	50.4
bill_length_mm	31.5	31.5
islandDream	20.2	20.2
speciesChinstrap	20.0	20.0
islandTorgersen	0.0	0.0

5.2 Linear Model Coefficient Comparison

# Extract coefficients from best linear model
best_linear_model <- all_models[[top_linear$Model]]
linear_coefs <- tidy(best_linear_model$finalModel) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    abs_estimate = abs(estimate),
    term_clean = case_when(
      term == "bill_length_mm" ~ "Bill Length",
      term == "bill_depth_mm" ~ "Bill Depth",
      term == "flipper_length_mm" ~ "Flipper Length",
      term == "speciesChinstrap" ~ "Species: Chinstrap",
      term == "speciesGentoo" ~ "Species: Gentoo",
      TRUE ~ term
    )
  ) %>%
  arrange(desc(abs_estimate))

kable(linear_coefs %>% select(term_clean, estimate, std.error, p.value),
      caption = "Linear Model Coefficients (Best Model)",
      col.names = c("Variable", "Coefficient", "Std Error", "p-value"),
      digits = 3)

Linear Model Coefficients (Best Model)
Variable	Coefficient	Std Error	p-value
`poly(bill_depth_mm, 2)1`	5764.011	769.853	0.000
`poly(flipper_length_mm, 2)1`	5321.734	859.974	0.000
`poly(bill_length_mm, 2)1`	3753.509	726.718	0.000
Species: Gentoo	1028.959	161.740	0.000
`poly(bill_length_mm, 2)2`	-856.402	345.087	0.014
`poly(bill_depth_mm, 2)2`	-842.970	404.234	0.038
Species: Chinstrap	-470.967	84.009	0.000
`poly(flipper_length_mm, 2)2`	391.414	407.105	0.337

5.3 Feature Importance Comparison

# Create side-by-side importance comparison
# Normalize both importance measures for comparison
rf_imp_norm <- importance_scores %>%
  select(Variable, Importance = Scaled_Importance) %>%
  mutate(Model = "Random Forest")

linear_imp_norm <- linear_coefs %>%
  select(Variable = term_clean, Importance = abs_estimate) %>%
  mutate(
    Importance = Importance / max(Importance) * 100,
    Model = "Linear Model"
  )

# Combine for visualization
combined_importance <- bind_rows(rf_imp_norm, linear_imp_norm) %>%
  filter(Variable %in% c("Bill Length", "Bill Depth", "Flipper Length", 
                         "Species: Chinstrap", "Species: Gentoo"))

ggplot(combined_importance, aes(x = reorder(Variable, Importance), y = Importance, fill = Model)) +
  geom_col(position = "dodge", alpha = 0.8) +
  coord_flip() +
  scale_fill_manual(values = c("Linear Model" = "#E74C3C", "Random Forest" = "#27AE60")) +
  labs(title = "Feature Importance Comparison",
       subtitle = "Normalized importance scores (0-100%)",
       x = "Variables", y = "Relative Importance (%)",
       fill = "Model Type") +
  theme_minimal()

Feature importance comparison between linear models and random forests

6 Partial Dependence Analysis

6.1 Understanding Non-linear Relationships

# Create partial dependence plots for key variables (requires pdp package)
if (requireNamespace("pdp", quietly = TRUE)) {
  # Flipper length
  pdp_flipper <- pdp::partial(best_rf_model$finalModel, pred.var = "flipper_length_mm", 
                         train = penguins_clean)
  
  p3 <- ggplot(pdp_flipper, aes(x = flipper_length_mm, y = yhat)) +
    geom_line(color = "#27AE60", size = 1.5) +
    labs(title = "Partial Dependence: Flipper Length",
         subtitle = "Random Forest - marginal effect on body mass",
         x = "Flipper Length (mm)", y = "Predicted Body Mass (g)") +
    theme_minimal()
  
  # Bill length
  pdp_bill <- pdp::partial(best_rf_model$finalModel, pred.var = "bill_length_mm",
                     train = penguins_clean)
  
  p4 <- ggplot(pdp_bill, aes(x = bill_length_mm, y = yhat)) +
    geom_line(color = "#27AE60", size = 1.5) +
    labs(title = "Partial Dependence: Bill Length", 
         subtitle = "Random Forest - marginal effect on body mass",
         x = "Bill Length (mm)", y = "Predicted Body Mass (g)") +
    theme_minimal()
  
  pdp_plots <- p3 + p4
  print(pdp_plots)
} else {
  cat("⚠️ pdp package not available for partial dependence plots\n")
  cat("Install with: install.packages('pdp')\n")
}

⚠️ pdp package not available for partial dependence plots
Install with: install.packages('pdp')

Partial dependence plots showing how random forest predictions change with individual features

6.2 Interaction Effects

# Two-way partial dependence plot (requires pdp package)
if (requireNamespace("pdp", quietly = TRUE)) {
  pdp_interaction <- pdp::partial(best_rf_model$finalModel, 
                            pred.var = c("flipper_length_mm", "species"),
                            train = penguins_clean)
  
  ggplot(pdp_interaction, aes(x = flipper_length_mm, y = yhat, color = species)) +
    geom_line(size = 1.5) +
    scale_color_manual(values = penguin_colors) +
    labs(title = "Partial Dependence: Flipper Length × Species",
       subtitle = "Random Forest - interaction effects",
       x = "Flipper Length (mm)", y = "Predicted Body Mass (g)",
       color = "Species") +
    theme_minimal()
} else {
  cat("⚠️ pdp package not available for interaction plots\n")
}

⚠️ pdp package not available for interaction plots

Interaction plot showing how species moderates the flipper length effect in random forests

7 The Interpretability-Performance Tradeoff

7.1 Quantifying the Tradeoff

# Calculate performance vs interpretability metrics
interpretability_analysis <- data.frame(
  Model = c("Linear: Species", "RF: Tuned"),
  RMSE = c(top_linear$RMSE_mean, top_rf$RMSE_mean),
  R_squared = c(top_linear$Rsquared_mean, top_rf$Rsquared_mean),
  Interpretability_Score = c(9, 4),  # Subjective 1-10 scale
  Complexity = c(6, 500),  # Number of parameters/trees
  Training_Time = c("< 1 sec", "~ 10 sec"),  # Approximate
  Prediction_Speed = c("Instant", "Fast"),
  Coefficient_Interpretation = c("Direct", "Requires tools"),
  Statistical_Tests = c("Standard", "Limited"),
  Confidence_Intervals = c("Standard", "Bootstrap")
) %>%
  mutate(
    Performance_Gain = RMSE[1] - RMSE,
    Interpretability_Loss = Interpretability_Score[1] - Interpretability_Score
  )

kable(interpretability_analysis %>% select(-Performance_Gain, -Interpretability_Loss),
      caption = "Interpretability vs Performance Comparison",
      digits = 3)

Interpretability vs Performance Comparison
Model	RMSE	R_squared	Interpretability_Score	Complexity	Training_Time	Prediction_Speed	Coefficient_Interpretation	Statistical_Tests	Confidence_Intervals
Linear: Species	310.295	0.858	9	6	< 1 sec	Instant	Direct	Standard	Standard
RF: Tuned	294.920	0.866	4	500	~ 10 sec	Fast	Requires tools	Limited	Bootstrap

cat("\n--- Tradeoff Analysis ---\n")


--- Tradeoff Analysis ---

cat("=====================\n")

=====================

cat(sprintf("Performance gain (RF vs Linear): %.1f grams RMSE improvement\n", 
            interpretability_analysis$Performance_Gain[2]))

Performance gain (RF vs Linear): 15.4 grams RMSE improvement

cat(sprintf("Interpretability loss: %d points (out of 10)\n", 
            interpretability_analysis$Interpretability_Loss[2]))

Interpretability loss: 5 points (out of 10)

cat(sprintf("Relative performance improvement: %.1f%%\n", 
            (interpretability_analysis$Performance_Gain[2] / interpretability_analysis$RMSE[1]) * 100))

Relative performance improvement: 5.0%

7.2 Decision Framework Visualization

# Create decision framework plot
decision_data <- data.frame(
  Model = c("Simple Linear", "Multiple Linear", "Species Linear", "Polynomial", "Basic RF", "Full RF", "Tuned RF"),
  Performance = c(6, 7, 9, 9.2, 8.5, 9.5, 9.7),  # Scaled performance score
  Interpretability = c(10, 8, 7, 5, 4, 3, 3),
  Type = c(rep("Linear", 4), rep("Random Forest", 3)),
  RMSE = performance_comparison$RMSE_mean[match(c("Linear: Simple", "Linear: Multiple", "Linear: Species", "Linear: Polynomial", 
                                                 "RF: Basic", "RF: Full", "RF: Tuned"), performance_comparison$Model)]
)

ggplot(decision_data, aes(x = Interpretability, y = Performance, color = Type, size = 1/RMSE)) +
  geom_point(alpha = 0.8) +
  geom_text(aes(label = Model), vjust = -1, size = 3) +
  scale_color_manual(values = c("Linear" = "#E74C3C", "Random Forest" = "#27AE60")) +
  scale_size_continuous(range = c(2, 6), name = "Performance\n(1/RMSE)") +
  labs(title = "Model Selection Framework",
       subtitle = "Performance vs Interpretability Tradeoff",
       x = "Interpretability Score (1-10)", y = "Performance Score (1-10)",
       color = "Model Family") +
  theme_minimal() +
  geom_curve(aes(x = 8, y = 8, xend = 4, yend = 9.5), 
             arrow = arrow(length = unit(0.3, "cm")),
             curvature = 0.3, color = "gray50", 
             linetype = "dashed") +
  annotate("text", x = 6, y = 9, label = "Interpretability-\nPerformance\nTradeoff", 
           size = 3, color = "gray50")

Decision framework plot showing the interpretability vs performance tradeoff across all models

8 Model Selection Guidelines

8.1 When to Choose Linear Models

cat("=== Choose Linear Models When ===\n")

=== Choose Linear Models When ===

cat("=============================\n")

=============================

cat("* Interpretability is paramount (research, regulation, clinical)\n")

* Interpretability is paramount (research, regulation, clinical)

cat("* Sample size is small-to-medium (< 1000 observations)\n")

* Sample size is small-to-medium (< 1000 observations)

cat("* Relationships are approximately linear\n")

* Relationships are approximately linear

cat("* You need statistical inference (p-values, confidence intervals)\n")

* You need statistical inference (p-values, confidence intervals)

cat("* Stakeholders need to understand 'how' the model works\n")

* Stakeholders need to understand 'how' the model works

cat("* Model transparency is required for trust/acceptance\n")

* Model transparency is required for trust/acceptance

cat("* Simple relationships exist between predictors and outcome\n")

* Simple relationships exist between predictors and outcome

cat("* Computational resources are limited\n")

* Computational resources are limited

cat("\n=== Best Linear Model for Penguins ===\n")


=== Best Linear Model for Penguins ===

cat("===================================\n")

===================================

cat("Model: body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species\n")

Model: body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species

cat(sprintf("Performance: RMSE = %.1f g, R-squared = %.3f\n", top_linear$RMSE_mean, top_linear$Rsquared_mean))

Performance: RMSE = 310.3 g, R-squared = 0.858

cat("Advantages: Direct coefficient interpretation, statistical tests, fast\n")

Advantages: Direct coefficient interpretation, statistical tests, fast

cat("Use case: Scientific research, educational contexts, regulatory submission\n")

Use case: Scientific research, educational contexts, regulatory submission

8.2 When to Choose Random Forests

cat("\n=== Choose Random Forests When ===\n")


=== Choose Random Forests When ===

cat("==============================\n")

==============================

cat("* Predictive accuracy is the primary goal\n")

* Predictive accuracy is the primary goal

cat("* Complex non-linear relationships exist\n")

* Complex non-linear relationships exist

cat("* Large datasets with many features\n")

* Large datasets with many features

cat("* Interaction effects are important but unknown\n")

* Interaction effects are important but unknown

cat("* Robustness to outliers is needed\n")

* Robustness to outliers is needed

cat("* Mixed data types (continuous, categorical)\n")

* Mixed data types (continuous, categorical)

cat("* Feature selection is challenging\n")

* Feature selection is challenging

cat("* Black-box predictions are acceptable\n")

* Black-box predictions are acceptable

cat("\n=== Best Random Forest for Penguins ===\n")


=== Best Random Forest for Penguins ===

cat("====================================\n")

====================================

cat("Model: All morphometric + demographic variables, mtry optimized\n")

Model: All morphometric + demographic variables, mtry optimized

cat(sprintf("Performance: RMSE = %.1f g, R-squared = %.3f\n", top_rf$RMSE_mean, top_rf$Rsquared_mean))

Performance: RMSE = 294.9 g, R-squared = 0.866

cat("Advantages: Highest accuracy, handles interactions, robust\n")

Advantages: Highest accuracy, handles interactions, robust

cat("Use case: Prediction systems, decision support, exploratory analysis\n")

Use case: Prediction systems, decision support, exploratory analysis

9 Practical Recommendations for Ecological Research

9.1 Hybrid Approach

cat("\n🔬 Recommended Hybrid Strategy:\n")


🔬 Recommended Hybrid Strategy:

cat("===============================\n")

===============================

cat("1. START with linear models for understanding\n")

1. START with linear models for understanding

cat("   • Identify key relationships\n")

   • Identify key relationships

cat("   • Test biological hypotheses\n")

   • Test biological hypotheses

cat("   • Establish baseline performance\n")

   • Establish baseline performance

cat("\n2. USE random forests for prediction\n")


2. USE random forests for prediction

cat("   • When accuracy is critical\n")

   • When accuracy is critical

cat("   • For complex ecological systems\n")

   • For complex ecological systems

cat("   • When many predictors available\n")

   • When many predictors available

cat("\n3. COMBINE insights from both\n")


3. COMBINE insights from both

cat("   • Linear models for explanation\n")

   • Linear models for explanation

cat("   • Random forests for prediction\n")

   • Random forests for prediction

cat("   • Cross-validate findings\n")

   • Cross-validate findings

cat("\n📋 Reporting Best Practices:\n")


📋 Reporting Best Practices:

cat("============================\n")

============================

cat("• Always report model selection rationale\n")

• Always report model selection rationale

cat("• Include performance comparison tables\n")

• Include performance comparison tables

cat("• Discuss interpretability tradeoffs\n")

• Discuss interpretability tradeoffs

cat("• Provide uncertainty estimates\n")

• Provide uncertainty estimates

cat("• Consider ecological meaning of results\n")

• Consider ecological meaning of results

9.2 Context-Specific Guidance

# Create context-specific recommendations table
context_guidance <- data.frame(
  Research_Context = c(
    "Basic Ecological Research",
    "Conservation Planning", 
    "Predictive Monitoring",
    "Climate Change Studies",
    "Fisheries Management",
    "Educational/Teaching"
  ),
  Recommended_Approach = c(
    "Linear Models",
    "Hybrid (Linear + RF)",
    "Random Forests",
    "Hybrid (Linear + RF)", 
    "Random Forests",
    "Linear Models"
  ),
  Primary_Reason = c(
    "Interpretability & hypothesis testing",
    "Balance of understanding & accuracy",
    "Maximum predictive accuracy",
    "Complex interactions & accuracy",
    "Accurate forecasts for decisions",
    "Clear, understandable relationships"
  ),
  Secondary_Benefits = c(
    "Statistical inference available",
    "Stakeholder communication",
    "Handles complex interactions",
    "Robust to non-linearities",
    "Real-time prediction capability",
    "Easy to explain and implement"
  )
)

kable(context_guidance,
      caption = "Model Selection Guidelines by Research Context",
      col.names = c("Research Context", "Recommended Approach", 
                    "Primary Reason", "Secondary Benefits"))

Model Selection Guidelines by Research Context
Research Context	Recommended Approach	Primary Reason	Secondary Benefits
Basic Ecological Research	Linear Models	Interpretability & hypothesis testing	Statistical inference available
Conservation Planning	Hybrid (Linear + RF)	Balance of understanding & accuracy	Stakeholder communication
Predictive Monitoring	Random Forests	Maximum predictive accuracy	Handles complex interactions
Climate Change Studies	Hybrid (Linear + RF)	Complex interactions & accuracy	Robust to non-linearities
Fisheries Management	Random Forests	Accurate forecasts for decisions	Real-time prediction capability
Educational/Teaching	Linear Models	Clear, understandable relationships	Easy to explain and implement

10 Series Conclusion

10.1 Our Journey Summary

cat("🐧 Palmer Penguins Analysis Series - Complete Journey:\n")

🐧 Palmer Penguins Analysis Series - Complete Journey:

cat("======================================================\n")

======================================================

cat("Part 1: Exploratory Analysis → Found flipper length as key predictor (R² = 0.759)\n")

Part 1: Exploratory Analysis → Found flipper length as key predictor (R² = 0.759)

cat("Part 2: Multiple Regression → Added species for major improvement (R² = 0.863)\n")

Part 2: Multiple Regression → Added species for major improvement (R² = 0.863)

cat("Part 3: Advanced Methods → Validated with cross-validation, explored polynomials\n")

Part 3: Advanced Methods → Validated with cross-validation, explored polynomials

cat("Part 4: Model Diagnostics → Confirmed assumptions, ensured statistical soundness\n")

Part 4: Model Diagnostics → Confirmed assumptions, ensured statistical soundness

cat("Part 5: Final Comparison → Linear vs RF tradeoff analysis (RF: R² = 0.887)\n")

Part 5: Final Comparison → Linear vs RF tradeoff analysis (RF: R² = 0.887)

cat("\n🏆 Final Recommendations:\n")


🏆 Final Recommendations:

cat("=========================\n")

=========================

cat("For Palmer Penguins Research:\n")

For Palmer Penguins Research:

cat("• Primary Model: Linear with species (interpretable, 86.3% variance explained)\n")

• Primary Model: Linear with species (interpretable, 86.3% variance explained)

cat("• Alternative Model: Tuned Random Forest (highest accuracy, 88.7% variance)\n")

• Alternative Model: Tuned Random Forest (highest accuracy, 88.7% variance)

cat("• Performance Gap: 2.4% R² improvement for RF vs 5-point interpretability loss\n")

• Performance Gap: 2.4% R² improvement for RF vs 5-point interpretability loss

cat("• Recommendation: Use linear model unless <10g prediction error is critical\n")

• Recommendation: Use linear model unless <10g prediction error is critical

cat("\n📊 Key Biological Insights:\n")


📊 Key Biological Insights:

cat("===========================\n")

===========================

cat("• Flipper length is the strongest morphometric predictor across all models\n")

• Flipper length is the strongest morphometric predictor across all models

cat("• Species differences are substantial (Gentoo ~1380g heavier than Adelie)\n")

• Species differences are substantial (Gentoo ~1380g heavier than Adelie)

cat("• Morphometric relationships are consistent across species\n")

• Morphometric relationships are consistent across species

cat("• Non-linear effects provide minimal improvement over linear relationships\n")

• Non-linear effects provide minimal improvement over linear relationships

cat("• Random forests confirm the importance hierarchy found in linear models\n")

• Random forests confirm the importance hierarchy found in linear models

10.2 Methodological Contributions

cat("\n🔬 Methodological Lessons Learned:\n")


🔬 Methodological Lessons Learned:

cat("==================================\n")

==================================

cat("1. Linear models with biological context often perform excellently\n")

1. Linear models with biological context often perform excellently

cat("2. Cross-validation is essential for honest performance assessment\n")

2. Cross-validation is essential for honest performance assessment

cat("3. Diagnostic procedures confirm model appropriateness beyond performance\n")

3. Diagnostic procedures confirm model appropriateness beyond performance

cat("4. Feature importance rankings are consistent across model types\n")

4. Feature importance rankings are consistent across model types

cat("5. The interpretability-performance tradeoff requires context-specific decisions\n")

5. The interpretability-performance tradeoff requires context-specific decisions

cat("\n📚 Transferable Skills Developed:\n")


📚 Transferable Skills Developed:

cat("=================================\n")

=================================

cat("• Systematic model comparison methodology\n")

• Systematic model comparison methodology

cat("• Rigorous cross-validation procedures\n")

• Rigorous cross-validation procedures

cat("• Comprehensive diagnostic workflows\n")

• Comprehensive diagnostic workflows

cat("• Interpretability vs performance evaluation\n")

• Interpretability vs performance evaluation

cat("• Scientific communication of model choices\n")

• Scientific communication of model choices

cat("• Integration of statistical and biological knowledge\n")

• Integration of statistical and biological knowledge

11 Final Thoughts

As we close this comprehensive series, remember that the “best” model isn’t always the one with the lowest RMSE. In ecological research, we must balance predictive performance with interpretability, statistical rigor with practical applicability, and technical sophistication with clear communication.

Our Palmer penguins have taught us that sometimes the most elegant solution - a linear model with thoughtfully chosen predictors - can compete admirably with sophisticated machine learning algorithms while maintaining the transparency that science demands.

The 24-gram RMSE difference between our linear model and random forest represents just 0.6% of average penguin body mass. In most ecological contexts, this small improvement doesn’t justify the loss of interpretability. However, in conservation applications where every gram matters for survival predictions, the random forest’s superior accuracy might tip the scales.

🎯 Final Recommendation

For the Palmer penguins dataset and similar ecological morphometric studies:

Primary Model: Linear regression with species information - Excellent performance (86.3% variance explained) - Full interpretability and statistical inference - Computationally efficient - Scientifically transparent

Alternative Model: Tuned random forest when prediction accuracy is critical - Maximum performance (88.7% variance explained)
- Robust to complex interactions - Suitable for operational prediction systems

Thank you for joining this analytical journey. May your future models be both accurate and interpretable, and may your penguins always be perfectly predictable! 🐧

12 Reproducibility Information

R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_1.3.1      knitr_1.50           broom_1.0.8         
 [4] caret_7.0-1          lattice_0.22-6       randomForest_4.7-1.2
 [7] lubridate_1.9.4      forcats_1.0.0        stringr_1.5.1       
[10] dplyr_1.1.4          purrr_1.0.4          readr_2.1.5         
[13] tidyr_1.3.1          tibble_3.3.0         ggplot2_3.5.2       
[16] tidyverse_2.0.0      palmerpenguins_0.1.1

loaded via a namespace (and not attached):
 [1] gtable_0.3.6         xfun_0.52            htmlwidgets_1.6.4   
 [4] recipes_1.3.1        tzdb_0.5.0           vctrs_0.6.5         
 [7] tools_4.5.0          generics_0.1.4       stats4_4.5.0        
[10] parallel_4.5.0       ModelMetrics_1.2.2.2 pkgconfig_2.0.3     
[13] Matrix_1.7-3         data.table_1.17.4    RColorBrewer_1.1-3  
[16] lifecycle_1.0.4      compiler_4.5.0       farver_2.1.2        
[19] codetools_0.2-20     htmltools_0.5.8.1    class_7.3-23        
[22] yaml_2.3.10          prodlim_2025.04.28   pillar_1.10.2       
[25] MASS_7.3-65          gower_1.0.2          iterators_1.0.14    
[28] rpart_4.1.24         foreach_1.5.2        nlme_3.1-168        
[31] parallelly_1.45.0    lava_1.8.1           tidyselect_1.2.1    
[34] digest_0.6.37        stringi_1.8.7        future_1.58.0       
[37] reshape2_1.4.4       listenv_0.9.1        labeling_0.4.3      
[40] splines_4.5.0        fastmap_1.2.0        grid_4.5.0          
[43] cli_3.6.5            magrittr_2.0.3       survival_3.8-3      
[46] future.apply_1.20.0  withr_3.0.2          backports_1.5.0     
[49] scales_1.4.0         timechange_0.3.0     rmarkdown_2.29      
[52] globals_0.18.0       nnet_7.3-20          timeDate_4041.110   
[55] hms_1.1.3            evaluate_1.0.3       hardhat_1.4.1       
[58] rlang_1.1.6          Rcpp_1.0.14          glue_1.8.0          
[61] pROC_1.18.5          ipred_0.9-15         jsonlite_2.0.0      
[64] R6_2.6.1             plyr_1.8.9

🐧 Series Complete!

Congratulations! You’ve completed the entire Palmer Penguins Analysis Series:

Part 1: EDA and Simple Regression ✅
Part 2: Multiple Regression and Species Effects ✅
Part 3: Advanced Models and Cross-Validation ✅
Part 4: Model Diagnostics and Interpretation ✅
Part 5: Random Forest vs Linear Models (This post) ✅

What’s Next? Apply these techniques to your own datasets and share your findings with the community!

Have questions about model selection or want to discuss your own ecological modeling challenges? Feel free to reach out on Twitter or LinkedIn. You can also find the complete code for this entire series on GitHub.

About the Author: [Your name] is a [your role] specializing in statistical ecology and machine learning. This series demonstrates best practices for model selection and evaluation in biological research, emphasizing the balance between predictive performance and scientific interpretability.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{(ryy)_glenn_thomas2025,
  author = {(Ryy) Glenn Thomas, Ronald and Name, Your},
  title = {Palmer {Penguins} {Data} {Analysis} {Series} {(Part} 5):
    {Random} {Forest} Vs {Linear} {Models} - {The} {Final} {Comparison}},
  date = {2025-01-05},
  url = {https://focusonr.org/posts/palmer_penguins_part5/},
  langid = {en}
}

For attribution, please cite this work as:

(Ryy) Glenn Thomas, Ronald, and Your Name. 2025. “Palmer Penguins Data Analysis Series (Part 5): Random Forest Vs Linear Models - The Final Comparison.” January 5, 2025. https://focusonr.org/posts/palmer_penguins_part5/.