---
title: "BAF Fraud Modeling"
author: "Rob Wiederstein"
date: today
date-format: long
---
```{r}
#| label: setup
#| include: false
library(here)
library(targets)
library(knitr)
# Make chunk paths resolve relative to reports/
#knitr::opts_knit$set(root.dir = here::here("reports"))
# Declare deps for tar_quarto() (optional, but good)
invisible(targets::tar_read(report_assets))
```
# Introduction
## Bank Account Fraud Dataset{.incremental}
- Synthetic online account applications
- 1M rows (Base)
- 8 months (0–7)
- Base + 5 biased variants
- Label: Fraud vs Legit
- Fraud $\approx 1\%$
:::{.notes}
**What it is (plain English):** each row is a bank account opening application submitted online. Fraudsters may impersonate someone (identity theft) or invent a person; once approved they quickly exploit the credit line or use the account to move illicit funds.
**Why it exists:** the BAF *suite* was created as a large, realistic benchmark to stress-test ML performance and fairness under **dynamic / drifting** conditions and “extreme” class imbalance. The variants introduce controlled bias patterns; the Base set has no induced bias.
**How it was made:** the released data are **synthetic** (generated from a CTGAN trained on an anonymized, feature-engineered real dataset). Privacy protections mean no row corresponds to a real identifiable person.
**Time structure:** `month` ranges 0–7 (eight months). This is why we use chronological evaluation (train early months, test late months).
**Target variable:** datasheet label is `fraud_bool` (0/1). In our pipeline we rename/recode to `outcome` with labels “Legit” and “Fraud” for readability.
:::
## Typical Scenario{.incremental}
Fraudsters will
1. Impersonate someone or
2. Create fake identity then
3. Max out the line or
4. receive illicit payment
## Data Cleaning{.incremental}
- Relabel outcome.
- -1 → NA.
- Negative amount → NA.
- Write clean Parquet.
:::{.notes}
**Outcome**
- `fraud_bool` (0/1) → `outcome` ("Legit"/"Fraud"); drop `fraud_bool`.
**Missing encoded as values**
- Recode `-1` to `NA` for:
- `prev_address_months_count`
- `current_address_months_count`
- `bank_months_count`
- `session_length_in_minutes`
- `device_distinct_emails` (your data uses `device_distinct_emails_8w`; function handles either name)
**Range constraint**
- `intended_balcon_amount < 0` → `NA` (negative values are missing-encoding).
**Output**
- Saved cleaned dataset as Parquet under `03_primary/variant=Base/` partitioned by `month`.
:::
# Explore
## Variable Importance
```{r}
#| label: fig-var-imp
#| fig-cap: "Top 15 features driving the diagnostic model."
knitr::include_graphics("reports/figures/fig_var_imp.png")
```
:::{.notes}
The diagnostic LightGBM model shows that behavior and identity structure dominate the early splits.
:::
## Feature Interaction
```{r}
#| label: fig-hexbin-interaction
#| fig-cap: "Interaction between Credit Risk Score and Address History."
knitr::include_graphics("reports/figures/fig_hexbin_interaction.png")
```
:::{.notes}
Fraud clusters noticeably in high credit risk profiles combined with specific address tenure patterns.
:::
## Missingness Signal
```{r}
#| label: fig-missingness
#| fig-cap: "Missingness rates by outcome."
knitr::include_graphics("reports/figures/fig_missingness.png")
```
:::{.notes}
Fraudsters are systematically omitting key tenure details (like previous address and bank history) compared to legitimate applicants.
:::
## Numeric Correlation
```{r}
#| label: fig-num-cor
#| fig-cap: "Core numeric correlation matrix."
knitr::include_graphics("reports/figures/fig_num_cor.png")
```
:::{.notes}
The structural anchor of the synthetic data is visible here, particularly the relationship between credit score and proposed limit.
:::
# LightGBM
## About {.incremental}
- Originally released in 2016
- Maintained by Microsoft
- Over 18,000 stars on GitHub
- King of Kaggle for tabular data
- Announcing paper over 23,000 citations
- Sped up similar gradient boosting algorithms 20x
## Academic Support
::: {.panel-tabset}
### Standard
>For tabular supervised learning, gradient boosted decision trees—most notably XGBoost and LightGBM—are strong, low-latency baselines because they exploit hand-engineered behavioral features; LightGBM remains a **standard** reference point for card and e-commerce fraud tasks [@aminian_fraudtransformer_2025]
### Accurate
>[W]e found that the LightGBM approach had the highest detection **accuracy** of fraudulent activity with 97% in the experiments conducted. An additional key objective of reducing false alerts was accomplished, as the number of false alarms went from 13,024 to 6,249[@iscan_walletbased_2023]
### Efficient
>[W]e choose LightGBM as the base machine learning model due to its **efficiency** and widespread use in handling large-scale and structured datasets, particularly in financial domains such as credit card fraud detection.[@zhao_improved_2024]
:::
# Unbalanced Classes
## The Challenge
>The scarce occurrences of rare events impair the detection task …
:::{.notes}
**Citation:** Guo, H., Li, Y., Shang, J., Gu, M., Huang, Y., & Gong, B. (2017).
*Learning from class-imbalanced data: Review of methods and applications.*
**Expert Systems with Applications, 73**, 220–239. https://doi.org/10.1016/j.eswa.2016.12.035
:::
## Bank Fraud Prevalence
```{r}
#| label: fig-fraud-prevalence-plot
#| fig-cap: "Fraudulent versus legitimate applications by month."
knitr::include_graphics("reports/figures/fig_fraud_by_month.png")
```
:::{.notes}
Fraud represents approximately one percent of applications.
:::
## Fraud Prevalence
```{r}
#| label: tbl-fraud-by-month
#| tbl-cap: "Something"
readRDS("reports/tables/tbl_fraud_by_month.rds")
```
## Methods Tested{.incremental}
- **Standard:** Baseline (No sampling).
- **Weighted:** Cost-sensitive learning ($4\times$ penalty).
- **Undersampling:** Random removal of majority class.
- **SMOTE:** Synthetic Minority Over-sampling Technique.
- **ADASYN:** Adaptive Synthetic Sampling (hard examples).
- **Tomek Links:** Cleaning boundary ambiguity.
:::{.notes}
**Standard:** The control group. We let the gradient booster handle the 1% imbalance naturally.
**Weighted:** We used `scale_pos_weight` to tell LightGBM that missing a Fraud case is 4x worse than a false alarm.
**Undersampling:** We threw away about 75% of the Legit cases to balance the ratio. Fast, but risky.
**SMOTE & ADASYN:** The "heavy hitters." These generate fake fraud data based on nearest neighbors. Adasyn focuses specifically on "hard to learn" fraud cases.
**Tomek:** A cleaning method that removes Legit cases that are "too close" to Fraud cases, theoretically making the decision boundary clearer.
:::
## Strategy Showdown: Results
```{r}
#| label: tbl-strategy-showdown
#| tbl-cap: "Performance comparison across imbalance strategies using 3-month rolling windows."
readRDS("reports/tables/tbl_strategy_showdown.rds")
```
:::{.notes}
The "Standard" baseline is statistically indistinguishable from more complex methods like SMOTE and Adasyn (p > 0.05). Complex sampling provides no significant predictive gain for this dataset.
:::
## Sampling Compared
```{r}
#| label: fig-strategy-showdown
#| fig-cap: "PR-AUC performance versus computational training time."
knitr::include_graphics("reports/figures/fig_strategy_showdown.png")
```
:::{.notes}
The Standard strategy represents the "Efficient Frontier." It achieves near-peak performance while being nearly twice as fast as SMOTE or Adasyn. Tomek sampling actually degraded performance while increasing compute time.
:::
## Sampling Methods Discarded {.incremental}
- No statistical gain
- Resource intensive
- Scalability
:::{.notes}
Complex sampling methods like SMOTE and Adasyn do not outperform the baseline "Standard" model, as shown by their non-significant p-values (p > 0.05).
Synthetic generation and neighbor calculations nearly double the average training time per fold compared to the standard approach.
For larger file sizes, simplicity helps avoid memory bottlenecks and excessive compute costs.
Future model performance gains should focus in places other than sampling techniques.
:::
# Feature Creation
# Final Results
## The Confusion Matrix
```{r}
#| label: fig-confusion-matrix
#| echo: false
#| out-width: "100%"
knitr::include_graphics("resources/images/confusion-matrix.png")
```
:::{.notes}
The confusion matrix is the foundation of all classification metrics. Every metric we care about is derived from these four cells.
In the fraud context:
- **TN:** Legitimate application correctly approved. No harm done.
- **FP:** Legitimate application flagged as fraud. Customer friction, potential churn.
- **FN:** Fraud case missed. Direct financial loss — the costliest error.
- **TP:** Fraud correctly caught. The goal.
The key insight: not all errors are equal. A missed fraud case (FN) costs far more than a false alarm (FP). Our threshold and metric choices reflect this asymmetry.
:::
## Precision & Recall
$$\text{Recall} = \frac{TP}{TP + FN}$$
> Of all actual frauds, how many did we catch?
$$\text{Precision} = \frac{TP}{TP + FP}$$
> Of all flagged cases, how many were real fraud?
:::{.notes}
**Recall** (also called **detection rate**) is the primary metric for fraud detection. Missing a fraud case (FN) is costly, so we want Recall as high as possible. A model that flags every application gets a perfect detection rate — but at the cost of Precision.
**Precision** captures that cost: if we flag everything, every legitimate customer gets rejected. Precision measures how trustworthy our fraud flags actually are.
The **Precision-Recall tradeoff** is the core tension in fraud modeling. Lowering the decision threshold increases Recall (catch more fraud) but decreases Precision (more false alarms). The right balance depends on the operational cost of each error type.
Our model targets **~49% Recall at a 5% False Positive Rate** — a deliberate operating point chosen to limit customer friction while catching nearly half of fraud.
:::
## ROC vs Precision-Recall AUC
::: {.panel-tabset}
### ROC AUC
- Plots **Recall** vs **False Positive Rate**
- AUC = 0.5 is random; 1.0 is perfect
- Optimistic under class imbalance
- Inflated by the large TN pool
### PR AUC
- Plots **Precision** vs **Recall**
- Focuses entirely on the minority class
- Harder to game with a large Legit majority
- Preferred metric for fraud detection
:::
:::{.notes}
**Why ROC AUC can mislead on imbalanced data:** with 99% legitimate applications, even a naive model achieves a low False Positive Rate simply because the TN pool is enormous. ROC AUC rewards this, making models look better than they are.
**PR AUC** ignores true negatives entirely. It only asks: of the positive (fraud) predictions, how precise were we, and how much fraud did we recall? This makes it a far more honest scoreboard when positives are rare.
**Rule of thumb:** use ROC AUC for balanced classes; use PR AUC for imbalanced fraud/anomaly detection tasks. We report both, but optimise for PR AUC.
:::
## Final Model Evaluation
```{r}
#| label: fig-conf-mat-heatmap
#| echo: false
#| out-width: "100%"
#| fig-cap: "Confusion Matrix Heatmap (5% Decision Threshold)"
knitr::include_graphics("reports/figures/fig_final_conf_mat.png")
```
## Diagnostic Metrics
```{r}
#| label: fig-final-curves
#| echo: false
#| out-width: "100%"
#| fig-cap: "ROC and Precision-Recall Curves for Out-of-Sample Data"
knitr::include_graphics("reports/figures/fig_final_curves.png")
```
# References {.smaller}