initial commit

This commit is contained in:
2026-02-10 04:52:37 -05:00
commit 0476f6f8f8
65 changed files with 15368 additions and 0 deletions

919
index.qmd Normal file
View File

@@ -0,0 +1,919 @@
---
title: "From Mt. Olympus to the Okefenokee"
subtitle: "A Case Study in Spatial Modeling"
author: "Rob Wiederstein"
lang: en-US
smart: false
format:
revealjs:
from: markdown-smart
theme: [default, custom.scss]
css: assets/fonts.css
embed-resources: true
title-slide-attributes:
data-background-image: assets/study_sites_globe.png
data-background-size: 150%
data-background-position: center
style: "color: #222222;"
transition: fade
slide-number: true
scrollable: true
chalkboard: false
tbl-cap-location: bottom
toc: true
toc-depth: 1
toc-title: "Order"
fig-dpi: 300
fig-width: 10
fig-asp: 0.5
fig-align: center
resources:
- assets/fonts
execute:
echo: false
cache: false
bibliography: references.bib
csl: ieee-access.csl
nocite: |
@*
---
```{r}
#| label: setup
#| include: false
library(here)
library(targets)
library(gt)
library(ggplot2)
library(showtext)
source("R/functions.R")
setup_forestry_fonts()
```
# Introduction
## The Core Problem
::: {.incremental}
1. **Stationarity:** Do rules hold constant across different places?
2. **Spatial Leakage:** When location is added to a model, does it make it more accurate?
3. **Extrapolation:** How does a model perform when trained with location data and applied in a new location?
:::
::: {.notes}
* **Goal:** Test if "location" features trick the model into high accuracy that fails elsewhere.
:::
## The `forested` Package
:::: {.columns}
::: {.column width="25%"}
![](https://github.com/simonpcouch/forested/blob/main/inst/logo.png?raw=true){fig-align="center" width="80%"}
:::
::: {.column width="75%"}
- The `forested` data are from people who looked at a place to see if it was a forest.
- They work for the Forest Inventory and Analysis (FIA) program, part of the USDA.
- It would be cheaper if a forest could be predicted from weather data and land charateristics.
- Forests are in GA and WA.
:::
::::
::: {.notes}
**Speaker Notes:**
- The `forested` package is our primary data source, containing the raw measurements for both the Washington and Georgia "islands".
- We are auditing this package's features to see how well they predict the 'forested' outcome in two geographically distant locations.
:::
## The First Law of Geography
<br>
<br>
>"Everything is related to everything else, but near things are more related than distant things."[@tobler_computer_1970]
::: {.notes}
- This is the "First Law of Geography" and explains why our Random Forest "cheats" using Lat/Lon.
- Proximity Bias (Spatial Autocorrelation) creates high local accuracy but zero portability.
- We are testing for **Stationarity**: Do the biophysical rules of Washington still work in Georgia?
:::
## Caveat
<br/>
<br/>
>"It is not and has never been the case that Toblers first law of geography . . . always holds absolutely. This is and has always been an oversimplification, disguising possible underlying entitation, support, and other misspecification problems."[@pebesma_spatial_2025]
::: {.notes}
:::
## Forest Locations
```{r}
#| label: fig-us-map-forest-locations
#| fig-cap: "Map shows the geographic distance separating Washington and Georgia."
knitr::include_graphics(here::here("figs", "map_us_forests.png"))
```
:::{.notes}
- Washington is approximately 15 degrees north of Georgia and 30 degrees west.
- the sheer distance suggests that the respective forests are different.
:::
## Regional Forestation
```{r}
#| label: fig-map-wa-ga
#| fig-cap: "Washington (a) and Georgia (b) showing forested areas. Note that the states are rescaled independently to maximize clarity."
#tar_read(map_wa_ga)
knitr::include_graphics(here::here("figs", "map_wa_ga_forests.png"))
```
## Regional Topography
```{r}
#| label: fig-topo-compare
#| echo: false
#| fig-align: "center"
#| out-width: "100%"
#| fig-cap: "Topographic relief map of Washington (a) and Georgia (b). Note: Regions are not to scale and elevation ramps are independent (WA range is ~3x GA)."
knitr::include_graphics(
here::here("figs", "combined_topo.png")
)
```
:::{.notes}
- **Scale Disparity:** Remind the audience that WA peaks reach ~4,400m
while GA peaks reach ~1,450m. The color ramps are local.
- **Rain Shadow:** Point out the Cascade barrier in WA; this is the
primary driver for the precipitation variance in the model.
- **Modeling Link:** This extreme relief is why we use a Yeo-Johnson
transformation on elevation in our tidymodels recipe—a linear
scale would over-emphasize alpine peaks while flattening
the Georgia Piedmont.
:::
## Regional Rainfall
```{r}
#| label: fig-precip-hex
#| fig-cap: "Mean annual precipitation (mm). Note the extreme gradient in WA (training) vs. the relative uniformity of GA (target)."
targets::tar_read(map_precip_hex)
```
## Level III Ecoregions
```{r}
#| label: fig-ecoregion-comparison
#| fig-cap: "Washington (a) has nine distinct regions while Georgia (b) has seven. Data sourced from U.S. EPA Level III Ecoregions [@epa_ecoregions_2013; @omernik_ecoregions_1987]."
knitr::include_graphics(here::here("figs", "map_wa_ga_ecoregions.png"))
```
:::{.notes}
- Ecoregions denote areas with similar ecosystems and resources.
- The EPA defines 105 Level III regions for management.
- James Omernik drew these lines using holistic expert synthesis.
- Washington and Georgia share zero common ecoregions.
- Washington transitions rapidly from rainforests to arid deserts.
- This extreme heterogeneity makes random spatial modeling difficult.
:::
# Explore
## Descriptive Summary
```{r}
#| label: display-summary
#| echo: false
tar_read(tbl_forest_wa)
```
## Distributions
```{r}
#| label: fig-distributions
#| fig-cap: "Comparison of environmental variable distributions for forested vs. non-forested areas."
targets::tar_read(plot_distrib_wa)
```
::: {.notes}
**1. Topic Introduction**
- This slide presents a univariate audit of our numeric predictors to identify which biophysical features provide the strongest signal for forestation.
- By comparing the "fingerprints" of forested (green) and non-forested (brown) plots, we can visually assess the potential for classification before we begin training models on our EPYC VM.
**2. Axis Definitions**
- **The X-Axis (Value)**: Represents the measurement for each specific biophysical variable, such as millimeters of rain or degrees Celsius.
- **The Y-Axis (Density)**: Represents the probability density for a given value; higher peaks indicate a higher frequency of observations at that specific value within the dataset.
**3. Significant Variables (High Contrast)**
- **Precipitation (`precip_annual`)**: This is a primary driver; forested plots are heavily concentrated in higher rainfall zones, while non-forested plots dominate the dry end of the spectrum.
- **Elevation**: There is a distinct "Forestation Window"; plots between 1,000 and 2,000 meters show a significant green peak, whereas non-forested plots cluster at lower elevations.
- **Temperature (`temp_annual_max` & `mean`)**: Forested plots consistently peak at lower maximum and mean temperatures compared to non-forested areas, suggesting a thermal threshold for forest growth.
- **Vapor Pressure (`vapor_max` & `min`)**: We see a strong bimodal separation; forested areas occupy a specific atmospheric moisture niche distinctly different from non-forested regions.
**4. Non-Significant Variables (High Overlap)**
- **Orientation (`eastness` & `northness`)**: These distributions are nearly identical for both classes, suggesting that cardinal direction alone is a weak predictor in this regional regime.
- **Roughness**: While there is a slight lean toward forested plots being in rougher terrain, the massive overlap indicates surface texture is not a primary discriminator.
:::
## Outliers
```{r}
#| label: outliers
tar_read(plt_outliers)
```
## Map Outliers
```{r}
#| label: fig-wa-outliers
#| fig-cap: "Map of observations with a value greater than three standard deviations from mean."
#| out-width: "100%"
knitr::include_graphics("figs/wa_outliers.png")
```
::: {.notes}
**1. Topic Introduction**
* This map visualizes our "3-Sigma" outliers, which are heavily concentrated along the mountainous west side of the state.
**2. The Orographic Factor**
* The concentration on the west is driven by the Cascades and Olympics. These regions host our most extreme biophysical values for precipitation and elevation.
**3. Intermixed Extremes**
* Note the intermixing of green and brown points. In these high-volatility alpine zones, a forest and a barren ridge often share the same coordinates.
* This proves that local "nearness" is not enough to predict forestation here; the model must rely on the specific biophysical drivers we identified in our density distributions.
**4. The Audit Link**
* These outliers represent the "edge cases" of our Washington model. Their intermixed nature makes them the hardest points to classify, serving as a preview for our Georgia transfer test.
:::
## Principal Component Analysis
```{r}
#| label: pca
tar_read(plt_wa_pca)
```
::: {.notes}
**1. What is PCA?**
PCA (Principal Component Analysis) is a dimensionality reduction tool that takes our 16 variables—including latitude, longitude, and climate data—and compresses them into two primary axes called Principal Components. It allows us to view the 'shape' of the entire Washington dataset in a single 2D space.
**2. Why use it here?**
We use it to explore the structural integrity of our data. Before building a model, we need to know if the environment of 'Forested' plots is actually mathematically different from 'Non-Forested' plots. By including lat and lon, we are seeing the combined power of geography and biophysics.
**3. Does it show anything?**
It shows that the data is not a random cloud; it has a clear orientation. The spread along PC1 captures the primary environmental gradient of Washington—likely moving from the moist coast to the arid east.
**4. Is there good separation on the outcome variable?**
Yes, the separation is significant. We see a distinct 'No' (Non-Forested) cluster forming a tail on the right and a dense 'Yes' (Forested) cluster on the left. While there is 'Alpine Mixing' in the center where categories overlap, the two groups occupy mostly different regions of the feature space.
**5. What does it foreshadow for modeling?**
This separation foreshadows high accuracy for our local Washington model. Because the classes are so distinct in this space, a logistic regression should have no trouble drawing a boundary between them. However, the tight coupling of biophysics with coordinates (lat/lon) here warns us that the model might 'memorize' Washington's map, which will be the primary challenge when we attempt to transfer it to Georgia.
:::
## Correlogram
```{r}
#| label: correlogram
tar_read(plt_correlogram)
```
::: {.notes}
**1. Orientation**
- If you look at the very first column on the left, we can see exactly what drives our "Forested" classification.
- Blue means "More Forest," Orange means "Less Forest."
**2. The Sanity Check**
- First, look at the bottom square: **Canopy Cover (0.75)**.
- This is our sanity check. Obviously, forests have high canopy cover. If this wasn't blue, our data would be broken.
**3. The Biophysical Story: Water vs. Heat**
- The real story is the battle between water and heat.
- **Precipitation (0.52)** is a strong blue driver. In Washington, rain equals trees.
- **Vapor Pressure (-0.64)** is a deep orange driver. High vapor pressure—which correlates with hot, dry valleys—effectively kills the forest probability.
**4. The Terrain Factor**
- Look at **Roughness (0.39)**.
- Rugged, difficult terrain is more likely to be forested. This is likely a mix of biophysics (mountains catch rain) and human history (flat land gets cleared for farming).
**5. The Surprise**
- Finally, look at **Northness and Eastness**. They are near **zero**.
- This tells us that while the *direction* a slope faces might change *which* trees grow there, it doesn't determine *if* trees grow there.
:::
## VIP
```{r}
#| label: variable-importance-plt
tar_read(plt_vip)
```
::: {.notes}
**1. The Comparison**
- "We just looked at Correlations (linear relationships). Now let's look at Variable Importance via Random Forest. This is what the model actually uses to make decisions."
**2. The Consistency**
- "The top three are the same: Canopy Cover, Rain, and Aridity (Vapor Pressure). This confirms our model is learning real biophysics."
**3. Spatial Factors**
- "But look at number 4: **Longitude**."
- "In the correlation chart, Longitude was just a moderate factor. Here, it is massive."
- "The model has learned that Washington is divided into two distinct climate zones—West and East."
- "Instead of learning the physics of *why* trees grow there, it's partially just memorizing *where* they grow. This confirms our hypothesis: the model is using geography as a shortcut. And is worth remembering when we apply it to the Georgia data.
:::
## UMAP
```{r}
#| label: umap-plot
tar_read(umap_plot)
```
## Spatial Dependency Analysis
```{r}
#| label: fig-moran
#| echo: false
#| fig-align: "center"
#| fig-cap: "<b>Moran Scatterplot.</b> The strong positive slope confirms significant spatial autocorrelation ($I > 0.6$)."
tar_read(p_moran_exploration)
```
:::{.notes}
SPEAKER NOTES:
1. THE VISUAL EVIDENCE: Point out the steep, positive slope of the
red dashed line. This slope is a visual representation of the
Global Morans I. A positive slope confirms that high-elevation
plots are surrounded by other high-elevation plots (Top Right
Quadrant), while low-elevation areas are also clustered (Bottom Left).
2. THE "CHEATING" PROBLEM: Explain that this clustering is why
standard Random Cross-Validation is insufficient. If a training
point and a testing point are only 5km apart, the model can
effectively "cheat" by using local similarities rather than
learning the broader ecological relationships.
3. THE JUSTIFICATION: This plot is the primary justification for:
- Using Spatial Block Cross-Validation to force the model to
predict on entirely unseen regions.
- Removing "Northness" and "County" as predictors to prevent the
model from simply memorizing regional averages.
- Applying the Yeo-Johnson transformation to normalize the extreme
elevation variance seen in these clustered Cascade peaks.
4. THE SCALE: Note that we used a 5km fixed-distance neighborhood
transformed into Washington State Plane North (meters) to ensure
the spatial relationships are geographically accurate.
:::
# Resampling
## Spatial Autocorrelation
<br/>
<br/>
>"When data are not independent (e.g. due to spatial autocorrelation), random cross-validation yields optimistic estimates of predictive performance because training and test sets are not independent."[@roberts_crossvalidation_2017]
:::{.notes}
**1. Translation of the Quote**
This quote describes the "Golden Rule" of geography: "Everything is related to everything else, but near things are more related than distant things."
**2. Definition: Spatial Autocorrelation**
Spatial Autocorrelation just means that data points close to each other are practically clones. If it's raining at your house, it's probably raining at your neighbor's house.
**3. forested dataset**
Forests are "clumpy." If you stand next to a Douglas Fir in Washington and take one step to the left, you are almost certainly still in a forest. The elevation, soil, and rain are identical.
**4. Why Random CV is "Optimistic" (The Cheating)**
- When the standard **Random Cross-Validation** is used, the first tree is assigned to the "Study Group" and the second tree (one step away) to the "Test Group."
- The model doesn't learn ecology. It just looks at the neighbor (lat and lon) and copies the answer.
- This gives us an **"Optimistic Estimate"**—a fancy way of saying our high score was fake because the model was cheating off its neighbor.
:::
## The Mechanics of Resampling
```{r}
#| label: fig-resampling
#| echo: false
#| fig-cap: "Visualizing the resampling process [@kuhn_tidy_2022]"
#| fig-align: "center"
#| out-width: "75%"
#| out-extra: 'style="width:75%;"'
knitr::include_graphics(here::here("images", "resampling.svg"))
```
:::{.notes}
- **The Concept:** Resampling methods (like cross-validation and bootstrapping) are **empirical simulation systems**. They generate different versions of our training set to simulate how the model handles new data.
- **The Golden Rule:** It is critical to remember: Resampling is *always* used with the **Training set**. The **Test set** is not involved.
- **The Vocabulary:** To avoid confusion with our initial Train/Test split, we use specific language for these internal loops:
- **Analysis Set:** The subset used to **fit** the model.
- **Assessment Set:** The subset used to **evaluate** performance.
- **The Mechanism:** In every iteration, these two sets are **mutually exclusive**. We fit on the Analysis set, and we measure performance on the Assessment set.
- **The Why:** As we discussed, simply re-predicting the training set is problematic (it leads to optimism bias). Resampling allows us to get a realistic appraisal using the training set without ever touching the final test data.
:::
## Random K-Fold Cross-Validation
```{r}
#| label: fig-classic-cv
#| fig-cap: "Conceptual diagram showing the random assignment of observations to the analysis and assessment groups."
tar_read("fig_classic_cv")
```
## Cross Validation Strategies
```{r}
#| label: fig-cv-strategies
#| echo: false
#| fig-width: 14
#| fig-height: 5
#| out-width: "100%"
#| fig-cap: "Three validation strategies. **Left:** Random splitting mixes train/test points. **Middle:** Spatial blocking forces geographic separation. **Right:** Clustering blocks by environmental similarity. (Note the outline for the Columbia Plateau. See @fig-ecoregion-comparison.)"
tar_read(plot_cv_comparison)
```
::: {.notes}
**1. Left Panel: Random CV (The Illusion of Accuracy)**
- This visualizes why Random CV yields **over-optimistic estimates**.
- Because the colors are mixed (Random), the model can accurately predict a "Red" point simply by memorizing the "Blue" point next to it.
- This isn't "true" predictive power; it is **autocorrelation leakage**. The model is interpolating neighbors rather than learning the underlying ecological rules.
**2. Middle Panel: Spatial Blocking (Forcing Independence)**
- To get a **realistic assessment**, we must enforce spatial independence.
- The grid structure ensures that the test data (Red blocks) is geographically distinct from the training data.
- The performance score will likely drop compared to the first map, but that lower score is **more accurate**. It reflects how the model will actually perform on a new, unvisited site.
**3. Right Panel: Environmental Clustering (Testing Generalization)**
- This strategy tests for **ecological generalization**.
- Notice the large red area in the southeast—the algorithm identified the **Columbia Plateau** as a distinct environment.
- By holding out entire environments (e.g., training on "Wet Coastal" to predict "Dry Plateau"), we test if the model captures the fundamental biological relationships (e.g., how temp/rain affect trees) rather than just memorizing geographic trends.
:::
## Analysis vs. Assessment
```{r}
#| label: fig-mechanics
#| fig-cap: "Visualization of Fold 1 across three cross-validation strategies. Magenta points represent the held-out assessment set."
tar_read(fig_fold_mechanics)
```
::: {.notes}
- **Visualizing the Split**: This slide illustrates Fold 1 of 5; gray points represent the "Analysis" set used for training, while magenta points represent the "Assessment" set the model must predict.
- **Confetti vs. Blocks**: The Random split (left) creates a "confetti" effect where every test point is surrounded by nearby training points, leading to the spatial autocorrelation and "optimism bias" we discussed earlier.
- **Geographic and Ecological Isolation**: The middle and right maps show how we force the model to predict across geographic and ecological gaps.
- **The Columbia Plateau Test**: Specifically in the Environmental Clustering map (right), the entire Columbia Plateau is isolated as a test set.
- **Validating Results**: Because the model had to "learn" forests in the mountains to predict the Plateau, we gained high confidence in its performance there, which was later confirmed by the near-zero error rate in that region.
- **Preparation for Georgia**: This level of isolation is a direct rehearsal for our next step, where we move from the Washington ecoregions to the completely unfamiliar landscapes of Georgia.
:::
# Models
## Engines
::: {.incremental}
1. Logistic Regression
2. MARS
3. Random Forest
4. XGBoost
:::
::: {.notes}
**Logistic Regression:**
Simple, interpretable baseline. Captures linear relationships efficiently.
**MARS:**
Models non-linearities automatically. Good balance between linear and trees.
**Random Forest:**
Robust ensemble method. Reduces overfitting through averaging.
**XGBoost:**
High-performance gradient boosting. Often dominates on tabular data.
:::
## Recipe A: With Coords
```{.r code-line-numbers="2|3"}
recipe(forested ~ ., data = train_data) %>%
# geometry is ID, but lat/lon remain as predictors
update_role(geometry, new_role = "id") %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
```
::: {.notes}
**Base Strategy:** Standard approach uses latitude and longitude as predictive features. Risk is the model memorizing locations instead of learning rules.
:::
## Recipe B: No Coords
```{.r code-line-numbers="2|3"}
recipe(forested ~ ., data = train_data) %>%
# Explicitly remove lat/lon from training
update_role(geometry, lat, lon, new_role = "id") %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
```
::: {.notes}
**Non-Spatial:** Removes explicit coordinates to prevent spatial overfitting. Forces the model to rely solely on biological environmental signals.
:::
## Recipe C: Extensible
```{.r code-line-numbers="4-10|12"}
recipe(forested ~ ., data = train_data) %>%
update_role(geometry, lat, lon, new_role = "id") %>%
# 1. Remove political/time markers
step_rm(northness, county, year) %>%
# 2. Add Physics (Aridity & Temp Range)
step_ratio(precip_annual, denom = denom_vars(temp_annual_max)) %>%
step_mutate(
temp_range = temp_annual_max - temp_annual_min,
vpd_range = vapor_max - vapor_min
) %>%
# 3. Fix Skew (Critical for Logistic Regression)
step_YeoJohnson(elevation) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
```
::: {.notes}
**Extensible:** Engineers physics-based features like aridity and temperature range. Transforms skewed variables to help linear models extrapolate to new regions.
:::
## YeoJohnson Transformation
```{r}
#| echo: false
#| fig-align: center
#| fig-width: 10
#| fig-height: 5
#| fig-cap: "<b>Normalizing Elevation via Yeo-Johnson Transformation.</b> The raw elevation data (left) exhibits strong right-skewness, which can degrade linear model performance. Applying a Yeo-Johnson transformation with λ=0.49 (right) successfully normalizes the distribution, satisfying the linearity assumptions required for the Extensible Logistic Regression model."
tar_read(plot_yeo)
```
:::{.notes}
Why this matters:
- **The Problem (Left):** Raw elevation data is highly skewed. Linear models (like Logistic Regression) struggle with this because they assume a consistent relationship across the range.
- **The Solution (Right):** The Yeo-Johnson transformation normalizes the distribution (bell curve).
- **The Result:** This allows the model to "see" the signal clearly, improving stability when moving to new regions like Georgia.
:::
## Resampling Strategies
```{.r code-line-numbers="2|5|8"}
# A. Random Folds (Standard)
vfold_cv(train_data, v = 10, strata = forested)
# B. Spatial Blocks (Grid-based)
spatial_block_cv(train_data, v = 10)
# C. Spatial Clustering (Region-based)
spatial_clustering_cv(train_data, v = 10)
```
::: {.notes}
- **Random Folds:** Standard approach. Randomly shuffles data. Dangerous here because it allows "cheating" via nearby pixels.
- **Spatial Blocks:** Divides the map into a checkerboard. Forces the model to predict on a blind grid square.
- **Spatial Clustering:** Uses K-means to create distinct ecological zones. The hardest test—simulates moving to a totally new region.
:::
# Results
## Spatial Validation Analysis
```{r}
#| label: fig-spatial-results
#| fig-cap: "Comparison of Model Performance (ROC AUC) across three spatial validation strategies. Benchmark (0.96 ROC AUC) indicated by the horizontal dashed line."
#| echo: false
#| message: false
tar_read(fig_cv_comparison)
```
::: {.notes}
- **Optimism Bias:** Notice the "Random CV" column. It shows nearly perfect performance (>0.95 ROC AUC). This is often a "spatial mirage" where the model is simply memorizing locations (autocorrelation) rather than learning environmental drivers.
- **Spatial Honesty:** The "Block" and "Cluster" columns provide a more realistic estimate of how the model will perform on new, geographically distant areas. This represents the "true" performance we should expect for out-of-sample prediction.
- **Feature Leakage:** Compare the "With Coords" vs. "No Coords" rows. If the "With Coords" model crashes in performance during Block CV but holds steady in Random CV, it is a clear sign of overfitting to spatial coordinates (Lat/Lon) rather than the underlying forest ecology.
- **Performance Benchmark:** The dashed line at 0.96 represents an established, high-performance baseline for forest classification. It serves as a "line in the sand" to determine if our machine learning approach provides a meaningful improvement over traditional methods; a model is only truly successful if it can exceed this 0.96 threshold under the pressure of spatial cross-validation.
:::
## Performance Stability
```{r}
#| label: fig-stability
#| fig-cap: "Distribution of ROC AUC scores across individual cross-validation folds. Note the variance in scores by resampling method."
#| echo: false
#| message: false
tar_read(fig_model_stability)
```
::: {.notes}
- **Falsely Confident:** In **Random CV**, notice how tightly clustered the points are at the top; the model's performance is artificially stable because every fold contains a representative "sprinkling" of the entire dataset.
- **The Reality of Variance:** As we transition to **Cluster CV**, the "violin" stretches out, indicating that the model performs significantly better in some geographic regions than others.
- **Identifying Weak Spots:** Each point in the Cluster CV column represents a specific geographic area; the points near the bottom of the violin represent "hard-to-predict" regions where the model's current features might be insufficient.
- **Predictive Risk:** While Random CV suggests the model is ~98% accurate everywhere, this plot proves that in some clusters, performance may actually dip toward 85%.
- **Stakeholder Transparency:** This variance is a critical insight for stakeholders, as it defines the geographic boundaries of where the model's predictions can be most (and least) trusted.
:::
## Predict on Test Set
```{r}
#| label: fig-final-test
tar_read(fig_final_performance)
```
::: {.notes}
- **Beyond the Fold:** This result represents the model's performance on the 20% test set that was "locked away" at the beginning of the project.
- **The Spatial Paradox:** You will notice our Test AUC (0.97) is actually *higher* than our Validation AUC (0.927). In standard AI, this is rare. In forestry, this tells us two things:
1. **Interpolation Power:** The high test score proves the model is excellent at "filling in gaps" within Washington, where it can leverage the patterns of nearby trees.
2. **Extrapolation Power:** The lower (0.927) validation score is our "honest" baseline for new regions, where we stripped away those spatial clues.
- **Classification Nuance:** The 91% Accuracy vs. 97% ROC suggests our model is a better "ranker" than a "classifier." It understands the *gradient* of forest probability better than the hard binary of "Tree vs. No Tree."
- **Validation Success:** The fact that our "Honest" Spatial CV score (0.93) is so high confirms that the 0.97 on the test set isn't just a fluke of spatial memory—it's built on a solid foundation of learning spectral signatures.
:::
## Test vs. Resample Performance
<br/>
<br/>
<br/>
```{r}
#| label: tbl-performance
#| echo: false
#| tbl-cap: "Comparision of model performance on resamples versus on the test set. Note that ROC increased."
targets::tar_read(tbl_performance)
```
## Confusion Matrix
```{r}
#| label: fig-confusion
#| fig-cap: "Confusion matrix showing the classification performance of the final model on the 20% held-out Washington test set."
tar_read(fig_confusion_matrix)
```
::: {.notes}
- **Anatomy of Error:** This matrix breaks down our 91.1% accuracy into specific types of successes and failures, helping us move beyond a single aggregate number.
- **Symmetry of Mistakes:** We are looking for balance between the off-diagonal squares; a heavy skew toward one side would indicate the model has a systematic bias toward over-predicting or under-predicting forest cover.
- **False Positives vs. Negatives:** In ecological terms, False Positives often represent "ghost forests" where the structure exists but the classification differs, while False Negatives are "missed forests" where the model failed to detect the canopy signal.
- **Probability Sensitivity:** Since our ROC AUC is a high 0.97, most of these errors likely occur at the "decision boundary"—meaning the model was nearly correct (e.g., 48% probability) but the hard 50% cutoff forced an error.
- **Production Readiness:** The high density in the True Positive and True Negative quadrants confirms that the model is robust enough for regional mapping, despite the inherent complexity of transition zones.
:::
## Benchmarks
::: {style="font-size: 75%;"}
| Authority | Study Context | Accuracy |
| :--- | :--- | :--- |
| **Ismail et al. (2013)** | Ideal Conditions (Sclerophyll Forest) | **96%** |
| **USGS NLCD** | Federal Standard (US Gov) | **91%** |
| **Our Model (WA)** | Pacific Northwest Training | **90.7%** |
| **Complex Boreal** | Difficult Terrain (Alaska) | **~78%** |
:::
## Spatial Error Analysis
```{r}
#| label: fig-map-wa-errors
#| out-width: "100%"
#| fig-cap: "<b>Map showing Type I and II errors from model.</b> Points are shaded from purple to bright yellow based upon the absolute error of the prediction probability. Note the lack of errors in the Columbia Basin."
knitr::include_graphics("figs/wa_errors.png")
```
::: {.notes}
- **The "Hallucinations":** We are looking at the ~130 mistakes the model made on the test set.
- **Confidence vs. Confusion:** The Red points are where the model was "confidently wrong" (high error magnitude). These aren't just close calls; the model was >90% sure based on the physics features (like elevation/aridity) but missed the biological reality.
- **Geography of Error:** Notice the clustering. The errors aren't random; they hug the alpine transition zones and the rugged coastline, suggesting the model struggles most at the "biophysical edges" where the rules of the forest change rapidly.
:::
# Extrapolation
## The Goal of Prediction {text-align="center"}
<br/>
<br/>
> "The fundamental goal of a model is not to describe the data we have, but to predict the data we don't."[@kuhn_applied_2013]
::: {.notes}
- This quote from Kuhn and Johnson is the foundation of our entire project.
- If our model doesn't generalize to the "second island" (Georgia), it has failed its fundamental goal.
:::
## Assessing Domain Applicability
```{r}
#| label: fig-aoa-georgia
#| echo: false
#| out-width: "100%"
#| fig-cap: "Area of Applicability (AOA) Analysis. The Dissimilarity Index (DI) measures how different the Georgia environment is from Washington's. Note the similarity to the Level III Ecoregion plot @fig-ecoregion-comparison."
targets::tar_read(plot_aoa_ga)
```
::: {.notes}
- Before we even attempt to predict forests in Georgia, we have to ask a fundamental question: **Is it fair to ask a Washington model to understand Georgia?** We can't just assume the rules of nature are the same. We need to measure the mathematical distance between these two worlds.
**What am I looking at?**
- This map **does not** show predictions. It shows **familiarity**.
- We calculated a **Dissimilarity Index** for every pixel in Georgia. Essentially, we asked the model: *"Have you seen conditions like this before?"*
- **Dark Purple/Black:** These areas are the "safe zones." The elevation, temperature, and precipitation here fall within the ranges the model learned in the Cascades.
- **Bright Yellow:** These are the "alien" zones. The combination of variables here (likely the hot, humid lowlands) is completely outside the model's experience. This is pure extrapolation.
**The Takeaway:**
- This creates a **Risk Map**. If our model fails, we expect it to fail *here* [gesture to yellow areas].
- It tells us where our confidence should be high (the purple) and where any prediction is just a wild guess (the yellow).
:::
## External Validation
```{r}
#| label: fig-ga-predictions
#| echo: false
#| out-width: "100%"
#| fig-cap: "The model predictions of forests in Georgia (a) versus the true forest inventory (b)."
targets::tar_read(map_ga_probs)
```
::: {.notes}
- We took the model trained in the Pacific Northwest and asked it: "Where are the forests in Georgia?"
- The Map: This shows the model's raw probability output.
- The Pattern: You can see it identifying the Blue Ridge Mountains (yellow/green) in the northeast.
- The Question: Does this match reality? Or is it seeing "forests" in places that are actually agricultural fields or swamps?
:::
## Quantifying the Error
```{r}
#| label: fig-ga-confusion
#| echo: false
#| out-width: "80%"
#| fig-align: "center"
#| fig-cap: "<b>Confusion Matrix (Georgia).</b> The model accuracy drops significantly compared to Washington. Note the high number of false negatives.(Prediction: No / Truth: Yes)."
targets::tar_read(ga_conf_mat)
```
## Mapping the Failures
```{r}
#| label: fig-ga-errors
#| echo: false
#| out-width: "100%"
#| fig-cap: "Spatial Distribution of Errors. (a) shows the dissimilarity of Georgia from Washington. (b) shows error density increasing in southern Georgia."
targets::tar_read(map_failure_mechanism)
```
::: {.notes}
**Visualizing the "Phantom Forests"**
- This map only shows the mistakes. And unlike Washington, where we had a handful of dots, here the map is lit up.
- **Orange Points (False Positives):** Look at the massive cluster in the South/Southeast.
- These are the **"Phantom Forests."**
- Notice how they perfectly overlap with the "Yellow Zone" (high dissimilarity) we identified at the start. The model saw crops and scrubland and hallucinated trees.
- **The Verdict:** The error isn't random. It is geographically structured. We broke the model exactly where the AOA predicted it would break.
:::
## Lessons Learned
::: {.incremental}
* **Accuracy Collapse:** ~89% (WA) $\to$ ~54% (GA).
* **AOA Validation:** The "Yellow Zone" correctly flagged the risk.
* **The Trap:** High confidence in "Phantom Forests."
* **The Fix:** Quantify domain distance *before* deployment.
:::
::: {.notes}
**1. The Numbers Don't Lie**
We witnessed a catastrophic failure in performance. In Washington, we had a precision instrument (90% accuracy). In Georgia, we essentially flipped a coin (54%). If we had deployed this model in production without validation, we would be generating random noise.
**2. The "Yellow Zone" Was a Warning, Not a Bug**
Remember that bright yellow map? That wasn't just a pretty picture. The Area of Applicability (AOA) screamed at us that the Southeastern Plains were alien territory. The model failed exactly where the AOA said it would—predicting forests in flat, hot agricultural zones it didn't understand.
**3. The "Phantom Forest" Problem**
Our confusion matrix showed a massive spike in False Positives. This is dangerous. The model didn't say "I don't know"; it confidently declared "Yes, there is a forest here." We call these "Phantom Forests." In a real-world scenario—like carbon credit monitoring or fire risk assessment—phantom forests cost millions of dollars.
**4. The Ultimate Takeaway**
The model failed, but the **workflow succeeded**. By calculating the multidimensional distance between our training data and our target data, we predicted *where* the model would break before we even ran it.
**Conclusion:** In spatial data science, you cannot simply "train and deploy." You must respect the ecological boundaries of your training data.
:::
# References
::: {#refs}
:::