Real-time fraud detection lakehouse
Streaming lakehouse on Apache Iceberg + Kafka processing 180K events/sec for a European payments processor, feeding ML fraud scores back to the transaction engine in under 200ms.
Tech stack
Problem
A payments processor was detecting fraud using a nightly batch job — by morning, €2M+ in fraudulent transactions had already been processed and reversed. The SLA target was a fraud score delivered to the transaction engine within 200ms of event ingestion.
Architecture
Ingestion: Debezium CDC from core banking PostgreSQL → Kafka (MSK) → Flink consumer writing raw events to Iceberg on S3 (bronze layer). Throughput: 180,000 events/second at peak.
Feature pipeline: Flink stateful operators computing 48 real-time features per transaction (velocity checks, geo-anomaly, merchant risk score) with 30-day sliding windows materialised in Iceberg snapshots.
Scoring: online feature store (Redis) populated from the Flink pipeline; XGBoost model served via FastAPI on ECS; model artefacts versioned in MLflow, promoted via a shadow-mode A/B framework before production cutover.
Feedback loop: fraud labels from the chargeback system written back via Kafka → Iceberg gold layer → weekly model retraining Airflow DAG.
Results
- —Fraud detection latency: batch (8+ hours) → 180ms p95.
- —Fraud loss rate: -23% in first 90 days post-launch.
- —False positive rate: 0.4% (below the 0.5% SLA target).
- —System handles Black Friday peaks (3x normal throughput) with no horizontal scaling changes.