Go to file

Rob Wiederstein b38892f49e Refactor: consistent naming across functions, targets, and pkgdown

Functions: prepare_eda_recipe -> build_eda_recipe,
           create_efficiency_plot -> plot_efficiency,
           format_class_imbalance_tourney_gt -> format_tournament_gt

Targets: model_inputs_prefix -> baf_model_input_prefix,
         tbl_fraud_by_month_data -> fraud_by_month_summary,
         model_diag -> diag_fit, winning_params -> best_params,
         production_recipe_blueprint -> prod_recipe,
         final_eval_data -> test_predictions

pkgdown: restructured reference index into 6 logical sections,
         removed stale names and development comments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-22 03:52:34 -05:00

man

Refactor: consistent naming across functions, targets, and pkgdown

2026-02-22 03:52:34 -05:00

Refactor: consistent naming across functions, targets, and pkgdown

2026-02-22 03:52:34 -05:00

renv

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

reports/figures

Add tune_lgbm() and wire hyperparameter tuning into DAG

2026-02-22 03:25:35 -05:00

resources/images

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

_pkgdown.yml

Refactor: consistent naming across functions, targets, and pkgdown

2026-02-22 03:52:34 -05:00

_quarto.yml

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

_targets.R

Refactor: consistent naming across functions, targets, and pkgdown

2026-02-22 03:52:34 -05:00

.dockerignore

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

.gitignore

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

.Rbuildignore

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

deploy.R

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

DESCRIPTION

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

ieee.csl

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

index.qmd

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

LICENSE

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

LICENSE.md

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

NAMESPACE

Refactor: consistent naming across functions, targets, and pkgdown

2026-02-22 03:52:34 -05:00

README.md

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

references.bib

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

renv.lock

Initial commit: BAF Lakehouse fraud detection pipeline

2026-02-21 21:19:09 -05:00

README.md

output

output
github_document

baflakehouse
- About
- Results
- Clone
- Acknowledgements
- Citation

baflakehouse

About

The baflakehouse package is an end-to-end machine learning pipeline built to detect credit card fraud. Rather than relying on static local files, it implements a modern Lakehouse architecture. It ingests a massive 1-million-row dataset, partitions it into Parquet files via Apache Arrow, stores it on a MinIO object server, and trains a production-ready LightGBM model orchestrated entirely by the targets package. Significance

Financial fraud datasets suffer from extreme class imbalance, making traditional accuracy metrics highly misleading. This pipeline is engineered specifically to handle that imbalance without aggressive synthetic oversampling.

Pipeline

The pipeline is orchestrated by the targets package and executes as a reproducible DAG. All data is stored remotely in MinIO and accessed via Apache Arrow — no local CSVs or intermediate files on disk.

Layer 01 → 02 | Ingest Raw CSVs are read from baf-fraud/01_raw and converted to Hive-partitioned Parquet files in 02_intermediate using Arrow's write_dataset().

Layer 02 → 03 | Clean Sentinel values (-1) are recoded to NA, the binary outcome is relabelled from fraud_bool to outcome ("Fraud"/"Legit"), and the cleaned data is written to 03_primary partitioned by month.

Layer 03 → 04 | Feature Engineering A missingness count feature (n_missing) is computed out-of-memory via Arrow compute and written to 04_feature.

Layer 04 → 05 | Resampling Five versions of each monthly slice are generated — Baseline, Undersampling, SMOTE, ADASYN, and Tomek Links — and saved to 05_model_input.

Imbalance Tournament LightGBM models are trained across all five strategies using three sliding time windows (train on months t, t+1, t+2; test on t+3). Strategies are ranked by PR-AUC and evaluated for statistical significance via paired t-test against the Standard baseline.

Strategy	PR-AUC	Avg Train Time (s)	Sig. vs Standard
Standard	0.1650	2.19	—
ADASYN	0.1629	3.87	No (p = 0.37)
SMOTE	0.1617	3.79	No (p = 0.15)
Weighted	0.1577	2.18	No (p = 0.15)
Tomek	0.1483	2.16	Yes (p = 0.009)
Undersampling	0.1394	0.92	Yes (p = 0.029)

The Standard baseline wins outright. SMOTE and ADASYN offer no statistically significant gain while nearly doubling training time. Tomek Links and Undersampling significantly hurt performance and are discarded.

Layer 05 → 06 | Production The winning Standard strategy is retrained on months 0–5, evaluated on the held-out months 6–7, serialised to baf_lgbm_prod_v1.txt, and uploaded to baf-fraud/06_models in MinIO.

Reporting All figures and tables are written to reports/ and assembled into a Quarto RevealJS slide deck via tar_quarto().

Results

By leveraging LightGBM's native cost-sensitive learning (scale_pos_weight) and leaf-wise tree growth, the production model achieves an elite ~49.1% Recall at a strict 5% False Positive Rate (FPR). It maximizes the detection of fraudulent applications while minimizing the number of legitimate customers flagged for manual review.

Clone

To replicate this pipeline locally, you will need to clone the repository and set up your MinIO environment variables.

git clone

Once your .Renviron is configured with your BAF_KEY, BAF_SECRET, and BAF_ENDPOINT, you can execute the entire DAG:

Acknowledgements

This project utilizes the Bank Account Fraud (BAF) dataset, originally published and presented at NeurIPS 2022. It is a massive, privacy-preserving suite of realistic tabular data designed specifically for evaluating fairness and performance in machine learning fraud detection.

README.md Unescape Escape

baflakehouse

About

Pipeline

Results

Clone

Acknowledgements

Citation

README.md