Challenge

How do you build a model that can catch rare fraud events without slowing down the user experience?

Solution

A combination of cost-sensitive learning (XGBoost) and a highly optimized Python inference service.

Project Files
src
pipeline.py
models
fraud_model.joblib
pipeline.py
1class PredictionPipeline:
2 """Inference pipeline with training-time feature parity."""
3 def __init__(self, model_path, encoders_path, medians_path):
4 self.model = joblib.load(model_path)
5 self.encoders = joblib.load(encoders_path)
6 self.medians = joblib.load(medians_path)
7
8 def preprocess(self, input_df: pd.DataFrame) -> pd.DataFrame:
9 df = input_df.copy()
10
11 # 1. Feature Engineering: Time-based signals
12 df["hour"] = (df["TransactionDT"] // 3600) % 24
13
14 # 2. Categorical Encoding: Using fitted training encoders
15 for col, le in self.encoders.items():
16 if col in df.columns:
17 known = set(le.classes_)
18 df[col] = df[col].apply(lambda x: x if x in known else le.classes_[0])
19 df[col] = le.transform(df[col])
20
21 # 3. Numeric Imputation: Using training medians
22 for col, median in self.medians.items():
23 if col in df.columns:
24 df[col] = df[col].fillna(median)
25
26 return df[self.feature_cols]
Console
Initializing IEEE-CIS Fraud Detection environment...
Connecting to artifact registry...
Ready.

Building the Pipeline

My goal was to ensure 100% feature parity between my training environment and this serving pipeline.

Design Choice

I learned that even small differences in how data is imputed can lead to completely wrong predictions in production.

Technical Specs

Data

IEEE-CIS Fraud

Architecture

XGBoost v2.0

Latency

< 50ms

Core Focus

Schema Alignment

Implementation Details

  • Imputation via Medians
  • Label Encoding Persistence
  • Temporal Engineering