bank-fraud-baf-lakehouse/README.md

---
output: github_document
---

- [baflakehouse](#baflakehouse)
  - [About](#about)
  - [Results](#results)
  - [Clone](#clone)
  - [Acknowledgements](#acknowledgements)
  - [Citation](#citation)

# baflakehouse

## About

The baflakehouse package is an end-to-end machine learning pipeline built to detect credit card fraud. Rather than relying on static local files, it implements a modern Lakehouse architecture. It ingests a massive 1-million-row dataset, partitions it into Parquet files via Apache Arrow, stores it on a MinIO object server, and trains a production-ready LightGBM model orchestrated entirely by the targets package.
Significance

Financial fraud datasets suffer from extreme class imbalance, making traditional accuracy metrics highly misleading. This pipeline is engineered specifically to handle that imbalance without aggressive synthetic oversampling.

## Pipeline

The pipeline is orchestrated by the `targets` package and executes as a reproducible DAG. All data is stored remotely in MinIO and accessed via Apache Arrow — no local CSVs or intermediate files on disk.

**Layer 01 → 02 | Ingest**
Raw CSVs are read from `baf-fraud/01_raw` and converted to Hive-partitioned Parquet files in `02_intermediate` using Arrow's `write_dataset()`.

**Layer 02 → 03 | Clean**
Sentinel values (`-1`) are recoded to `NA`, the binary outcome is relabelled from `fraud_bool` to `outcome` ("Fraud"/"Legit"), and the cleaned data is written to `03_primary` partitioned by month.

**Layer 03 → 04 | Feature Engineering**
A missingness count feature (`n_missing`) is computed out-of-memory via Arrow compute and written to `04_feature`.

**Layer 04 → 05 | Resampling**
Five versions of each monthly slice are generated — Baseline, Undersampling, SMOTE, ADASYN, and Tomek Links — and saved to `05_model_input`.

**Imbalance Tournament**
LightGBM models are trained across all five strategies using three sliding time windows (train on months t, t+1, t+2; test on t+3). Strategies are ranked by PR-AUC and evaluated for statistical significance via paired t-test against the Standard baseline.

| Strategy | PR-AUC | Avg Train Time (s) | Sig. vs Standard |
|---|---|---|---|
| Standard | 0.1650 | 2.19 | — |
| ADASYN | 0.1629 | 3.87 | No (p = 0.37) |
| SMOTE | 0.1617 | 3.79 | No (p = 0.15) |
| Weighted | 0.1577 | 2.18 | No (p = 0.15) |
| Tomek | 0.1483 | 2.16 | **Yes (p = 0.009)** |
| Undersampling | 0.1394 | 0.92 | **Yes (p = 0.029)** |

The Standard baseline wins outright. SMOTE and ADASYN offer no statistically significant gain while nearly doubling training time. Tomek Links and Undersampling significantly *hurt* performance and are discarded.

**Layer 05 → 06 | Production**
The winning Standard strategy is retrained on months 0–5, evaluated on the held-out months 6–7, serialised to `baf_lgbm_prod_v1.txt`, and uploaded to `baf-fraud/06_models` in MinIO.

**Reporting**
All figures and tables are written to `reports/` and assembled into a Quarto RevealJS slide deck via `tar_quarto()`.

## Results

By leveraging LightGBM's native cost-sensitive learning (scale_pos_weight) and leaf-wise tree growth, the production model achieves an elite ~49.1% Recall at a strict 5% False Positive Rate (FPR). It maximizes the detection of fraudulent applications while minimizing the number of legitimate customers flagged for manual review.

## Clone

To replicate this pipeline locally, you will need to clone the repository and set up your MinIO environment variables.

```
git clone
```

Once your .Renviron is configured with your BAF_KEY, BAF_SECRET, and BAF_ENDPOINT, you can execute the entire DAG:


## Acknowledgements

This project utilizes the Bank Account Fraud (BAF) dataset, originally published and presented at NeurIPS 2022. It is a massive, privacy-preserving suite of realistic tabular data designed specifically for evaluating fairness and performance in machine learning fraud detection.

## Citation