Rob Wiederstein b38892f49e Refactor: consistent naming across functions, targets, and pkgdown
Functions: prepare_eda_recipe -> build_eda_recipe,
           create_efficiency_plot -> plot_efficiency,
           format_class_imbalance_tourney_gt -> format_tournament_gt

Targets: model_inputs_prefix -> baf_model_input_prefix,
         tbl_fraud_by_month_data -> fraud_by_month_summary,
         model_diag -> diag_fit, winning_params -> best_params,
         production_recipe_blueprint -> prod_recipe,
         final_eval_data -> test_predictions

pkgdown: restructured reference index into 6 logical sections,
         removed stale names and development comments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 03:52:34 -05:00

output
output
github_document

baflakehouse

About

The baflakehouse package is an end-to-end machine learning pipeline built to detect credit card fraud. Rather than relying on static local files, it implements a modern Lakehouse architecture. It ingests a massive 1-million-row dataset, partitions it into Parquet files via Apache Arrow, stores it on a MinIO object server, and trains a production-ready LightGBM model orchestrated entirely by the targets package. Significance

Financial fraud datasets suffer from extreme class imbalance, making traditional accuracy metrics highly misleading. This pipeline is engineered specifically to handle that imbalance without aggressive synthetic oversampling.

Pipeline

The pipeline is orchestrated by the targets package and executes as a reproducible DAG. All data is stored remotely in MinIO and accessed via Apache Arrow — no local CSVs or intermediate files on disk.

Layer 01 → 02 | Ingest Raw CSVs are read from baf-fraud/01_raw and converted to Hive-partitioned Parquet files in 02_intermediate using Arrow's write_dataset().

Layer 02 → 03 | Clean Sentinel values (-1) are recoded to NA, the binary outcome is relabelled from fraud_bool to outcome ("Fraud"/"Legit"), and the cleaned data is written to 03_primary partitioned by month.

Layer 03 → 04 | Feature Engineering A missingness count feature (n_missing) is computed out-of-memory via Arrow compute and written to 04_feature.

Layer 04 → 05 | Resampling Five versions of each monthly slice are generated — Baseline, Undersampling, SMOTE, ADASYN, and Tomek Links — and saved to 05_model_input.

Imbalance Tournament LightGBM models are trained across all five strategies using three sliding time windows (train on months t, t+1, t+2; test on t+3). Strategies are ranked by PR-AUC and evaluated for statistical significance via paired t-test against the Standard baseline.

Strategy PR-AUC Avg Train Time (s) Sig. vs Standard
Standard 0.1650 2.19
ADASYN 0.1629 3.87 No (p = 0.37)
SMOTE 0.1617 3.79 No (p = 0.15)
Weighted 0.1577 2.18 No (p = 0.15)
Tomek 0.1483 2.16 Yes (p = 0.009)
Undersampling 0.1394 0.92 Yes (p = 0.029)

The Standard baseline wins outright. SMOTE and ADASYN offer no statistically significant gain while nearly doubling training time. Tomek Links and Undersampling significantly hurt performance and are discarded.

Layer 05 → 06 | Production The winning Standard strategy is retrained on months 05, evaluated on the held-out months 67, serialised to baf_lgbm_prod_v1.txt, and uploaded to baf-fraud/06_models in MinIO.

Reporting All figures and tables are written to reports/ and assembled into a Quarto RevealJS slide deck via tar_quarto().

Results

By leveraging LightGBM's native cost-sensitive learning (scale_pos_weight) and leaf-wise tree growth, the production model achieves an elite ~49.1% Recall at a strict 5% False Positive Rate (FPR). It maximizes the detection of fraudulent applications while minimizing the number of legitimate customers flagged for manual review.

Clone

To replicate this pipeline locally, you will need to clone the repository and set up your MinIO environment variables.

git clone

Once your .Renviron is configured with your BAF_KEY, BAF_SECRET, and BAF_ENDPOINT, you can execute the entire DAG:

Acknowledgements

This project utilizes the Bank Account Fraud (BAF) dataset, originally published and presented at NeurIPS 2022. It is a massive, privacy-preserving suite of realistic tabular data designed specifically for evaluating fairness and performance in machine learning fraud detection.

Citation

Description
Exploring the highly unbalanced bank fraud dataset with tidymodels and lightGB.
Readme 1.5 MiB
Languages
R 83.2%
TeX 14.9%
Dockerfile 1.3%
Shell 0.6%