Initial commit: BAF Lakehouse fraud detection pipeline
End-to-end LightGBM fraud detection pipeline built as an R package, orchestrated by targets with data stored in MinIO via Apache Arrow. Includes 6-layer Lakehouse architecture, class imbalance tournament, formally tuned hyperparameters (PR-AUC 0.198), and Quarto RevealJS slides. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
76
README.md
Normal file
76
README.md
Normal file
@@ -0,0 +1,76 @@
|
||||
---
|
||||
output: github_document
|
||||
---
|
||||
|
||||
- [baflakehouse](#baflakehouse)
|
||||
- [About](#about)
|
||||
- [Results](#results)
|
||||
- [Clone](#clone)
|
||||
- [Acknowledgements](#acknowledgements)
|
||||
- [Citation](#citation)
|
||||
|
||||
# baflakehouse
|
||||
|
||||
## About
|
||||
|
||||
The baflakehouse package is an end-to-end machine learning pipeline built to detect credit card fraud. Rather than relying on static local files, it implements a modern Lakehouse architecture. It ingests a massive 1-million-row dataset, partitions it into Parquet files via Apache Arrow, stores it on a MinIO object server, and trains a production-ready LightGBM model orchestrated entirely by the targets package.
|
||||
Significance
|
||||
|
||||
Financial fraud datasets suffer from extreme class imbalance, making traditional accuracy metrics highly misleading. This pipeline is engineered specifically to handle that imbalance without aggressive synthetic oversampling.
|
||||
|
||||
## Pipeline
|
||||
|
||||
The pipeline is orchestrated by the `targets` package and executes as a reproducible DAG. All data is stored remotely in MinIO and accessed via Apache Arrow — no local CSVs or intermediate files on disk.
|
||||
|
||||
**Layer 01 → 02 | Ingest**
|
||||
Raw CSVs are read from `baf-fraud/01_raw` and converted to Hive-partitioned Parquet files in `02_intermediate` using Arrow's `write_dataset()`.
|
||||
|
||||
**Layer 02 → 03 | Clean**
|
||||
Sentinel values (`-1`) are recoded to `NA`, the binary outcome is relabelled from `fraud_bool` to `outcome` ("Fraud"/"Legit"), and the cleaned data is written to `03_primary` partitioned by month.
|
||||
|
||||
**Layer 03 → 04 | Feature Engineering**
|
||||
A missingness count feature (`n_missing`) is computed out-of-memory via Arrow compute and written to `04_feature`.
|
||||
|
||||
**Layer 04 → 05 | Resampling**
|
||||
Five versions of each monthly slice are generated — Baseline, Undersampling, SMOTE, ADASYN, and Tomek Links — and saved to `05_model_input`.
|
||||
|
||||
**Imbalance Tournament**
|
||||
LightGBM models are trained across all five strategies using three sliding time windows (train on months t, t+1, t+2; test on t+3). Strategies are ranked by PR-AUC and evaluated for statistical significance via paired t-test against the Standard baseline.
|
||||
|
||||
| Strategy | PR-AUC | Avg Train Time (s) | Sig. vs Standard |
|
||||
|---|---|---|---|
|
||||
| Standard | 0.1650 | 2.19 | — |
|
||||
| ADASYN | 0.1629 | 3.87 | No (p = 0.37) |
|
||||
| SMOTE | 0.1617 | 3.79 | No (p = 0.15) |
|
||||
| Weighted | 0.1577 | 2.18 | No (p = 0.15) |
|
||||
| Tomek | 0.1483 | 2.16 | **Yes (p = 0.009)** |
|
||||
| Undersampling | 0.1394 | 0.92 | **Yes (p = 0.029)** |
|
||||
|
||||
The Standard baseline wins outright. SMOTE and ADASYN offer no statistically significant gain while nearly doubling training time. Tomek Links and Undersampling significantly *hurt* performance and are discarded.
|
||||
|
||||
**Layer 05 → 06 | Production**
|
||||
The winning Standard strategy is retrained on months 0–5, evaluated on the held-out months 6–7, serialised to `baf_lgbm_prod_v1.txt`, and uploaded to `baf-fraud/06_models` in MinIO.
|
||||
|
||||
**Reporting**
|
||||
All figures and tables are written to `reports/` and assembled into a Quarto RevealJS slide deck via `tar_quarto()`.
|
||||
|
||||
## Results
|
||||
|
||||
By leveraging LightGBM's native cost-sensitive learning (scale_pos_weight) and leaf-wise tree growth, the production model achieves an elite ~49.1% Recall at a strict 5% False Positive Rate (FPR). It maximizes the detection of fraudulent applications while minimizing the number of legitimate customers flagged for manual review.
|
||||
|
||||
## Clone
|
||||
|
||||
To replicate this pipeline locally, you will need to clone the repository and set up your MinIO environment variables.
|
||||
|
||||
```
|
||||
git clone
|
||||
```
|
||||
|
||||
Once your .Renviron is configured with your BAF_KEY, BAF_SECRET, and BAF_ENDPOINT, you can execute the entire DAG:
|
||||
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
This project utilizes the Bank Account Fraud (BAF) dataset, originally published and presented at NeurIPS 2022. It is a massive, privacy-preserving suite of realistic tabular data designed specifically for evaluating fairness and performance in machine learning fraud detection.
|
||||
|
||||
## Citation
|
||||
Reference in New Issue
Block a user