End-to-end LightGBM fraud detection pipeline built as an R package, orchestrated by targets with data stored in MinIO via Apache Arrow. Includes 6-layer Lakehouse architecture, class imbalance tournament, formally tuned hyperparameters (PR-AUC 0.198), and Quarto RevealJS slides. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
77 lines
3.9 KiB
Markdown
77 lines
3.9 KiB
Markdown
---
|
||
output: github_document
|
||
---
|
||
|
||
- [baflakehouse](#baflakehouse)
|
||
- [About](#about)
|
||
- [Results](#results)
|
||
- [Clone](#clone)
|
||
- [Acknowledgements](#acknowledgements)
|
||
- [Citation](#citation)
|
||
|
||
# baflakehouse
|
||
|
||
## About
|
||
|
||
The baflakehouse package is an end-to-end machine learning pipeline built to detect credit card fraud. Rather than relying on static local files, it implements a modern Lakehouse architecture. It ingests a massive 1-million-row dataset, partitions it into Parquet files via Apache Arrow, stores it on a MinIO object server, and trains a production-ready LightGBM model orchestrated entirely by the targets package.
|
||
Significance
|
||
|
||
Financial fraud datasets suffer from extreme class imbalance, making traditional accuracy metrics highly misleading. This pipeline is engineered specifically to handle that imbalance without aggressive synthetic oversampling.
|
||
|
||
## Pipeline
|
||
|
||
The pipeline is orchestrated by the `targets` package and executes as a reproducible DAG. All data is stored remotely in MinIO and accessed via Apache Arrow — no local CSVs or intermediate files on disk.
|
||
|
||
**Layer 01 → 02 | Ingest**
|
||
Raw CSVs are read from `baf-fraud/01_raw` and converted to Hive-partitioned Parquet files in `02_intermediate` using Arrow's `write_dataset()`.
|
||
|
||
**Layer 02 → 03 | Clean**
|
||
Sentinel values (`-1`) are recoded to `NA`, the binary outcome is relabelled from `fraud_bool` to `outcome` ("Fraud"/"Legit"), and the cleaned data is written to `03_primary` partitioned by month.
|
||
|
||
**Layer 03 → 04 | Feature Engineering**
|
||
A missingness count feature (`n_missing`) is computed out-of-memory via Arrow compute and written to `04_feature`.
|
||
|
||
**Layer 04 → 05 | Resampling**
|
||
Five versions of each monthly slice are generated — Baseline, Undersampling, SMOTE, ADASYN, and Tomek Links — and saved to `05_model_input`.
|
||
|
||
**Imbalance Tournament**
|
||
LightGBM models are trained across all five strategies using three sliding time windows (train on months t, t+1, t+2; test on t+3). Strategies are ranked by PR-AUC and evaluated for statistical significance via paired t-test against the Standard baseline.
|
||
|
||
| Strategy | PR-AUC | Avg Train Time (s) | Sig. vs Standard |
|
||
|---|---|---|---|
|
||
| Standard | 0.1650 | 2.19 | — |
|
||
| ADASYN | 0.1629 | 3.87 | No (p = 0.37) |
|
||
| SMOTE | 0.1617 | 3.79 | No (p = 0.15) |
|
||
| Weighted | 0.1577 | 2.18 | No (p = 0.15) |
|
||
| Tomek | 0.1483 | 2.16 | **Yes (p = 0.009)** |
|
||
| Undersampling | 0.1394 | 0.92 | **Yes (p = 0.029)** |
|
||
|
||
The Standard baseline wins outright. SMOTE and ADASYN offer no statistically significant gain while nearly doubling training time. Tomek Links and Undersampling significantly *hurt* performance and are discarded.
|
||
|
||
**Layer 05 → 06 | Production**
|
||
The winning Standard strategy is retrained on months 0–5, evaluated on the held-out months 6–7, serialised to `baf_lgbm_prod_v1.txt`, and uploaded to `baf-fraud/06_models` in MinIO.
|
||
|
||
**Reporting**
|
||
All figures and tables are written to `reports/` and assembled into a Quarto RevealJS slide deck via `tar_quarto()`.
|
||
|
||
## Results
|
||
|
||
By leveraging LightGBM's native cost-sensitive learning (scale_pos_weight) and leaf-wise tree growth, the production model achieves an elite ~49.1% Recall at a strict 5% False Positive Rate (FPR). It maximizes the detection of fraudulent applications while minimizing the number of legitimate customers flagged for manual review.
|
||
|
||
## Clone
|
||
|
||
To replicate this pipeline locally, you will need to clone the repository and set up your MinIO environment variables.
|
||
|
||
```
|
||
git clone
|
||
```
|
||
|
||
Once your .Renviron is configured with your BAF_KEY, BAF_SECRET, and BAF_ENDPOINT, you can execute the entire DAG:
|
||
|
||
|
||
## Acknowledgements
|
||
|
||
This project utilizes the Bank Account Fraud (BAF) dataset, originally published and presented at NeurIPS 2022. It is a massive, privacy-preserving suite of realistic tabular data designed specifically for evaluating fairness and performance in machine learning fraud detection.
|
||
|
||
## Citation
|