Initial commit: BAF Lakehouse fraud detection pipeline

End-to-end LightGBM fraud detection pipeline built as an R package, orchestrated by targets with data stored in MinIO via Apache Arrow. Includes 6-layer Lakehouse architecture, class imbalance tournament, formally tuned hyperparameters (PR-AUC 0.198), and Quarto RevealJS slides. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 21:19:09 -05:00
commit 33d0fc31c7
56 changed files with 15596 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,76 @@
+---
+output: github_document
+---
+
+- [baflakehouse](#baflakehouse)
+  - [About](#about)
+  - [Results](#results)
+  - [Clone](#clone)
+  - [Acknowledgements](#acknowledgements)
+  - [Citation](#citation)
+
+# baflakehouse
+
+## About
+
+The baflakehouse package is an end-to-end machine learning pipeline built to detect credit card fraud. Rather than relying on static local files, it implements a modern Lakehouse architecture. It ingests a massive 1-million-row dataset, partitions it into Parquet files via Apache Arrow, stores it on a MinIO object server, and trains a production-ready LightGBM model orchestrated entirely by the targets package.
+Significance
+
+Financial fraud datasets suffer from extreme class imbalance, making traditional accuracy metrics highly misleading. This pipeline is engineered specifically to handle that imbalance without aggressive synthetic oversampling.
+
+## Pipeline
+
+The pipeline is orchestrated by the `targets` package and executes as a reproducible DAG. All data is stored remotely in MinIO and accessed via Apache Arrow — no local CSVs or intermediate files on disk.
+
+**Layer 01 → 02 | Ingest**
+Raw CSVs are read from `baf-fraud/01_raw` and converted to Hive-partitioned Parquet files in `02_intermediate` using Arrow's `write_dataset()`.
+
+**Layer 02 → 03 | Clean**
+Sentinel values (`-1`) are recoded to `NA`, the binary outcome is relabelled from `fraud_bool` to `outcome` ("Fraud"/"Legit"), and the cleaned data is written to `03_primary` partitioned by month.
+
+**Layer 03 → 04 | Feature Engineering**
+A missingness count feature (`n_missing`) is computed out-of-memory via Arrow compute and written to `04_feature`.
+
+**Layer 04 → 05 | Resampling**
+Five versions of each monthly slice are generated — Baseline, Undersampling, SMOTE, ADASYN, and Tomek Links — and saved to `05_model_input`.
+
+**Imbalance Tournament**
+LightGBM models are trained across all five strategies using three sliding time windows (train on months t, t+1, t+2; test on t+3). Strategies are ranked by PR-AUC and evaluated for statistical significance via paired t-test against the Standard baseline.
+
+| Strategy | PR-AUC | Avg Train Time (s) | Sig. vs Standard |
+|---|---|---|---|
+| Standard | 0.1650 | 2.19 | — |
+| ADASYN | 0.1629 | 3.87 | No (p = 0.37) |
+| SMOTE | 0.1617 | 3.79 | No (p = 0.15) |
+| Weighted | 0.1577 | 2.18 | No (p = 0.15) |
+| Tomek | 0.1483 | 2.16 | **Yes (p = 0.009)** |
+| Undersampling | 0.1394 | 0.92 | **Yes (p = 0.029)** |
+
+The Standard baseline wins outright. SMOTE and ADASYN offer no statistically significant gain while nearly doubling training time. Tomek Links and Undersampling significantly *hurt* performance and are discarded.
+
+**Layer 05 → 06 | Production**
+The winning Standard strategy is retrained on months 0–5, evaluated on the held-out months 6–7, serialised to `baf_lgbm_prod_v1.txt`, and uploaded to `baf-fraud/06_models` in MinIO.
+
+**Reporting**
+All figures and tables are written to `reports/` and assembled into a Quarto RevealJS slide deck via `tar_quarto()`.
+
+## Results
+
+By leveraging LightGBM's native cost-sensitive learning (scale_pos_weight) and leaf-wise tree growth, the production model achieves an elite ~49.1% Recall at a strict 5% False Positive Rate (FPR). It maximizes the detection of fraudulent applications while minimizing the number of legitimate customers flagged for manual review.
+
+## Clone
+
+To replicate this pipeline locally, you will need to clone the repository and set up your MinIO environment variables.
+
+```
+git clone
+```
+
+Once your .Renviron is configured with your BAF_KEY, BAF_SECRET, and BAF_ENDPOINT, you can execute the entire DAG:
+
+
+## Acknowledgements
+
+This project utilizes the Bank Account Fraud (BAF) dataset, originally published and presented at NeurIPS 2022. It is a massive, privacy-preserving suite of realistic tabular data designed specifically for evaluating fairness and performance in machine learning fraud detection.
+
+## Citation