Challenge
How do you build a model that can catch rare fraud events without slowing down the user experience?
Solution
A combination of cost-sensitive learning (XGBoost) and a highly optimized Python inference service.
Project Files
src
pipeline.py
models
fraud_model.joblib
pipeline.py
1class PredictionPipeline:
2 """Inference pipeline with training-time feature parity."""
3 def __init__(self, model_path, encoders_path, medians_path):
4 self.model = joblib.load(model_path)
5 self.encoders = joblib.load(encoders_path)
6 self.medians = joblib.load(medians_path)
7
8 def preprocess(self, input_df: pd.DataFrame) -> pd.DataFrame:
9 df = input_df.copy()
10
11 # 1. Feature Engineering: Time-based signals
12 df["hour"] = (df["TransactionDT"] // 3600) % 24
13
14 # 2. Categorical Encoding: Using fitted training encoders
15 for col, le in self.encoders.items():
16 if col in df.columns:
17 known = set(le.classes_)
18 df[col] = df[col].apply(lambda x: x if x in known else le.classes_[0])
19 df[col] = le.transform(df[col])
20
21 # 3. Numeric Imputation: Using training medians
22 for col, median in self.medians.items():
23 if col in df.columns:
24 df[col] = df[col].fillna(median)
25
26 return df[self.feature_cols]
Console
Initializing IEEE-CIS Fraud Detection environment...
Connecting to artifact registry...
Ready.
Building the Pipeline
My goal was to ensure 100% feature parity between my training environment and this serving pipeline.
Design Choice
I learned that even small differences in how data is imputed can lead to completely wrong predictions in production.
Technical Specs
Data
IEEE-CIS Fraud
Architecture
XGBoost v2.0
Latency
< 50ms
Core Focus
Schema Alignment
Implementation Details
- Imputation via Medians
- Label Encoding Persistence
- Temporal Engineering