Transfer Agent Platform
Rebuilding the shareholder register pipeline for a Tier-1 Luxembourgish bank. 70+ microservices, 300K+ transactions per hour, the operational tax that came with it.
The bank's legacy Transfer Agent was missing the daily NAV cutoff often enough that operations had built a whole spreadsheet workflow around it. When a fund's net asset value publishes late, downstream pricing breaks, transfer agents on the other side of Europe miss their windows, and someone has to write an apology letter. The mandate was to replace that system without disrupting a single trading day.
I joined the team rebuilding it from scratch as an event-driven distributed system on AWS. The shape we landed on was 70+ microservices around real banking operations: subscriptions, redemptions, settlement, NAV calculation, compliance reporting. Kafka as the spine, exactly-once on the critical paths, SQS for the messier fan-outs where ordering didn't matter.
What was actually hard
Throughput wasn't the problem. Kafka and EKS will happily push 300,000 transactions per hour if you let them. The hard parts were ordering and exactly-once semantics on the subscription and redemption streams, where a duplicate event means a shareholder gets billed twice and someone in operations has to unwind it.
We learned this the hard way. The first time we ran a load test that touched production-shape data, the NAV pipeline lagged by 6 minutes and a junior engineer (me, that day) discovered our Kafka partitions were unbalanced. One partition had 40% of the traffic because of a hash collision on a frequent fund identifier. Fixing it took an afternoon. Catching it earlier with a partition-skew metric would have taken ten minutes.
Tradeoffs we paid for
Kafka was the right call for ordering, but it came with operational weight: schema registries, partition tuning, consumer-group lag dashboards, the whole apparatus. For half of our async use cases, plain SQS would have been simpler. We picked Kafka for everything inter-service for consistency, and then quietly used SQS anyway whenever the team needed something cheap and disposable.
Postgres-per-service held up well. Aurora for the heavy read paths, plain Postgres on RDS for the rest. S3 for archival and audit trails the regulator wants in cold storage for years.
What I'd flag for next time
70+ services is too many for this workload. I'd ship something close with 15 to 20 if I started over. We'll get into that in a separate post. The team built the right system; we just built more of it than the domain demanded.
Throughput targets met with roughly 2x headroom. Independent deploys cut lead time from weeks to a day. Fault isolation works: a degraded service stops trading for one product line, not the platform.
Case study published in anonymized form. Specifics shared subject to NDA.