BATCH_01 // #07

ICM Fraud Detection

Real-time Scoring with Imbalanced Data

I built this to tackle one of the hardest problems in ML: extreme class imbalance (only 1% fraud). I learned how to use XGBoost's internal weighting and feature engineering to maintain sub-100ms latency. This project was my introduction to building high-throughput inference services with FastAPI.

Project Highlights

[+]Explored XGBoost's scale_pos_weight to handle extreme 30:1 class imbalance.
[+]Built a low-latency inference pipeline optimized for sub-100ms scoring.
[+]Learned the importance of feature parity between training and production environments.
[+]Implemented a stateless FastAPI service designed for easy horizontal scaling.

Tech Stack

#Python#FastAPI#XGBoost#Pandas#Scikit-learn#Joblib

Open Source Repository

Challenge

How do you build a model that can catch rare fraud events without slowing down the user experience?

Solution

A combination of cost-sensitive learning (XGBoost) and a highly optimized Python inference service.

Project Files

src

pipeline.py

models

fraud_model.joblib

pipeline.py

1class PredictionPipeline:

2 """Inference pipeline with training-time feature parity."""

3 def __init__(self, model_path, encoders_path, medians_path):

4 self.model = joblib.load(model_path)

5 self.encoders = joblib.load(encoders_path)

6 self.medians = joblib.load(medians_path)

8 def preprocess(self, input_df: pd.DataFrame) -> pd.DataFrame:

9 df = input_df.copy()

11 # 1. Feature Engineering: Time-based signals

12 df["hour"] = (df["TransactionDT"] // 3600) % 24

14 # 2. Categorical Encoding: Using fitted training encoders

15 for col, le in self.encoders.items():

16 if col in df.columns:

17 known = set(le.classes_)

18 df[col] = df[col].apply(lambda x: x if x in known else le.classes_[0])

19 df[col] = le.transform(df[col])

21 # 3. Numeric Imputation: Using training medians

22 for col, median in self.medians.items():

23 if col in df.columns:

24 df[col] = df[col].fillna(median)

26 return df[self.feature_cols]

Console

Initializing IEEE-CIS Fraud Detection environment...

Connecting to artifact registry...

Ready.

Building the Pipeline

My goal was to ensure 100% feature parity between my training environment and this serving pipeline.

Design Choice

I learned that even small differences in how data is imputed can lead to completely wrong predictions in production.

Technical Specs

Data

IEEE-CIS Fraud

Architecture

XGBoost v2.0

Latency

< 50ms

Core Focus

Schema Alignment

Implementation Details

Imputation via Medians
Label Encoding Persistence
Temporal Engineering