Files
bank-fraud-baf-lakehouse/README.md
Rob Wiederstein 85bc257e7b
All checks were successful
Deploy Lakehouse Docs / build-and-deploy (push) Successful in 8m44s
Lint & Format Check / Link Check (push) Successful in 3s
Lint & Format Check / Format Check (styler) (push) Successful in 14s
R Package Tests / test (push) Successful in 53s
Rename package from baflakehouse to bankfraud
- DESCRIPTION: Package name and URL updated to /bank-fraud
- R/baflakehouse-package.R → R/bankfraud-package.R
- _pkgdown.yml: url and reference alias updated
- deploy.yaml: TARGET_DIR updated to /var/www/docs/bank-fraud/
- deploy/baflakehouse.caddy: deleted (stale, superseded by rsync workflow)
- tests and README updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-23 09:38:54 -05:00

77 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
output: github_document
---
- [bankfraud](#bankfraud)
- [About](#about)
- [Results](#results)
- [Clone](#clone)
- [Acknowledgements](#acknowledgements)
- [Citation](#citation)
# bankfraud
## About
The bankfraud package is an end-to-end machine learning pipeline built to detect credit card fraud. Rather than relying on static local files, it implements a modern Lakehouse architecture. It ingests a massive 1-million-row dataset, partitions it into Parquet files via Apache Arrow, stores it on a MinIO object server, and trains a production-ready LightGBM model orchestrated entirely by the targets package.
Significance
Financial fraud datasets suffer from extreme class imbalance, making traditional accuracy metrics highly misleading. This pipeline is engineered specifically to handle that imbalance without aggressive synthetic oversampling.
## Pipeline
The pipeline is orchestrated by the `targets` package and executes as a reproducible DAG. All data is stored remotely in MinIO and accessed via Apache Arrow — no local CSVs or intermediate files on disk.
**Layer 01 → 02 | Ingest**
Raw CSVs are read from `baf-fraud/01_raw` and converted to Hive-partitioned Parquet files in `02_intermediate` using Arrow's `write_dataset()`.
**Layer 02 → 03 | Clean**
Sentinel values (`-1`) are recoded to `NA`, the binary outcome is relabelled from `fraud_bool` to `outcome` ("Fraud"/"Legit"), and the cleaned data is written to `03_primary` partitioned by month.
**Layer 03 → 04 | Feature Engineering**
A missingness count feature (`n_missing`) is computed out-of-memory via Arrow compute and written to `04_feature`.
**Layer 04 → 05 | Resampling**
Five versions of each monthly slice are generated — Baseline, Undersampling, SMOTE, ADASYN, and Tomek Links — and saved to `05_model_input`.
**Imbalance Tournament**
LightGBM models are trained across all five strategies using three sliding time windows (train on months t, t+1, t+2; test on t+3). Strategies are ranked by PR-AUC and evaluated for statistical significance via paired t-test against the Standard baseline.
| Strategy | PR-AUC | Avg Train Time (s) | Sig. vs Standard |
|---|---|---|---|
| Standard | 0.1650 | 2.19 | — |
| ADASYN | 0.1629 | 3.87 | No (p = 0.37) |
| SMOTE | 0.1617 | 3.79 | No (p = 0.15) |
| Weighted | 0.1577 | 2.18 | No (p = 0.15) |
| Tomek | 0.1483 | 2.16 | **Yes (p = 0.009)** |
| Undersampling | 0.1394 | 0.92 | **Yes (p = 0.029)** |
The Standard baseline wins outright. SMOTE and ADASYN offer no statistically significant gain while nearly doubling training time. Tomek Links and Undersampling significantly *hurt* performance and are discarded.
**Layer 05 → 06 | Production**
The winning Standard strategy is retrained on months 05, evaluated on the held-out months 67, serialised to `baf_lgbm_prod_v1.txt`, and uploaded to `baf-fraud/06_models` in MinIO.
**Reporting**
All figures and tables are written to `reports/` and assembled into a Quarto RevealJS slide deck via `tar_quarto()`.
## Results
By leveraging LightGBM's native cost-sensitive learning (scale_pos_weight) and leaf-wise tree growth, the production model achieves an elite ~49.1% Recall at a strict 5% False Positive Rate (FPR). It maximizes the detection of fraudulent applications while minimizing the number of legitimate customers flagged for manual review.
## Clone
To replicate this pipeline locally, you will need to clone the repository and set up your MinIO environment variables.
```
git clone
```
Once your .Renviron is configured with your BAF_KEY, BAF_SECRET, and BAF_ENDPOINT, you can execute the entire DAG:
## Acknowledgements
This project utilizes the Bank Account Fraud (BAF) dataset, originally published and presented at NeurIPS 2022. It is a massive, privacy-preserving suite of realistic tabular data designed specifically for evaluating fairness and performance in machine learning fraud detection.
## Citation