Converts scratch/tune_model.R into a pure tune_lgbm() function, replacing hardcoded winning_params with a fully automated tar_target. Best params (trees=844, depth=3, lr=0.0204, min_n=389) now flow reproducibly into evaluate_final_model() and train_production_model(). PR-AUC improved from 0.165 to 0.198. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
output
| output |
|---|
| github_document |
baflakehouse
About
The baflakehouse package is an end-to-end machine learning pipeline built to detect credit card fraud. Rather than relying on static local files, it implements a modern Lakehouse architecture. It ingests a massive 1-million-row dataset, partitions it into Parquet files via Apache Arrow, stores it on a MinIO object server, and trains a production-ready LightGBM model orchestrated entirely by the targets package. Significance
Financial fraud datasets suffer from extreme class imbalance, making traditional accuracy metrics highly misleading. This pipeline is engineered specifically to handle that imbalance without aggressive synthetic oversampling.
Pipeline
The pipeline is orchestrated by the targets package and executes as a reproducible DAG. All data is stored remotely in MinIO and accessed via Apache Arrow — no local CSVs or intermediate files on disk.
Layer 01 → 02 | Ingest
Raw CSVs are read from baf-fraud/01_raw and converted to Hive-partitioned Parquet files in 02_intermediate using Arrow's write_dataset().
Layer 02 → 03 | Clean
Sentinel values (-1) are recoded to NA, the binary outcome is relabelled from fraud_bool to outcome ("Fraud"/"Legit"), and the cleaned data is written to 03_primary partitioned by month.
Layer 03 → 04 | Feature Engineering
A missingness count feature (n_missing) is computed out-of-memory via Arrow compute and written to 04_feature.
Layer 04 → 05 | Resampling
Five versions of each monthly slice are generated — Baseline, Undersampling, SMOTE, ADASYN, and Tomek Links — and saved to 05_model_input.
Imbalance Tournament LightGBM models are trained across all five strategies using three sliding time windows (train on months t, t+1, t+2; test on t+3). Strategies are ranked by PR-AUC and evaluated for statistical significance via paired t-test against the Standard baseline.
| Strategy | PR-AUC | Avg Train Time (s) | Sig. vs Standard |
|---|---|---|---|
| Standard | 0.1650 | 2.19 | — |
| ADASYN | 0.1629 | 3.87 | No (p = 0.37) |
| SMOTE | 0.1617 | 3.79 | No (p = 0.15) |
| Weighted | 0.1577 | 2.18 | No (p = 0.15) |
| Tomek | 0.1483 | 2.16 | Yes (p = 0.009) |
| Undersampling | 0.1394 | 0.92 | Yes (p = 0.029) |
The Standard baseline wins outright. SMOTE and ADASYN offer no statistically significant gain while nearly doubling training time. Tomek Links and Undersampling significantly hurt performance and are discarded.
Layer 05 → 06 | Production
The winning Standard strategy is retrained on months 0–5, evaluated on the held-out months 6–7, serialised to baf_lgbm_prod_v1.txt, and uploaded to baf-fraud/06_models in MinIO.
Reporting
All figures and tables are written to reports/ and assembled into a Quarto RevealJS slide deck via tar_quarto().
Results
By leveraging LightGBM's native cost-sensitive learning (scale_pos_weight) and leaf-wise tree growth, the production model achieves an elite ~49.1% Recall at a strict 5% False Positive Rate (FPR). It maximizes the detection of fraudulent applications while minimizing the number of legitimate customers flagged for manual review.
Clone
To replicate this pipeline locally, you will need to clone the repository and set up your MinIO environment variables.
git clone
Once your .Renviron is configured with your BAF_KEY, BAF_SECRET, and BAF_ENDPOINT, you can execute the entire DAG:
Acknowledgements
This project utilizes the Bank Account Fraud (BAF) dataset, originally published and presented at NeurIPS 2022. It is a massive, privacy-preserving suite of realistic tabular data designed specifically for evaluating fairness and performance in machine learning fraud detection.