When Architecture Patterns Don't Match the Problem
Lessons from three attempts to build a distributed event processing platform
Executive Summary
Over six years, a fintech company built three successive solutions to process millions of financial transaction events for real-time customer alerts. Each solution lasted approximately two years before stakeholders demanded change. Each failed for different technical reasons, but a common organizational pattern persisted throughout: development teams remained out of alignment with stakeholder needs, and that gap was never closed.
I joined during Solution 2 and remained through Solution 3.
Solutions at a Glance
| Solution | Architecture | Why It Failed | Why It Was Abandoned |
|---|---|---|---|
| 1. Monolith | Legacy scheduler, non-containerized | Opaque, no extension points, manual scaling | Couldnβt support multi-team extensibility |
| 2. Coordinator | Central coordinator + domain processors | Implementation bugs (deadlocks, scaling issues) | Deemed too expensive to refactor |
| 3. Pipes & Filters | Distributed actors, SQS, Akka | Pattern mismatch, no observability, message bloat | SLA violations, customer payouts |
The Business Context
The company operated as an intermediary layer between banks, credit unions, and end users. Rather than requiring smaller financial institutions to build their own user interfaces and complex integrations, the platform provided a rich UI experience backed by transaction processing infrastructure.
The primary feature under development was real-time transaction alerts: customers subscribe to events on their accounts, and they receive email or SMS notifications when transactions occur. The processing happened primarily after transactions had been accepted by the financial institutions.
Scale and constraints:
- Millions of financial transaction events to process
- 8+ development teams needing to plug processors and workflows into the platform
- Financial compliance requirements demanding audit trails and reliability
- Customer SLAs with real monetary penalties for failures
Solution 1: The Monolith
The first attempt was a monolithic system built around a legacy scheduler. The architecture was poorly documented and poorly understood by the teams who inherited it.
Why it failed:
- Teams could not plug their processors and workflows into the existing infrastructure
- The systemβs behavior was opaque, making modifications risky
- No clear extension points existed for new functionality
- Scaling required manual configuration and human intervention rather than automated elasticity
- Non-containerized infrastructure made scaling expensive and slow
After approximately two years, stakeholders demanded change. The diagnosis was correct: the monolith couldnβt support multi-team extensibility or cost-effective scaling. The prescription was a move to microservices.
Solution 2: The Coordinator
The second solution introduced a microservices architecture with a central coordinator pattern:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Central Coordinator β
β (Queue management, event brokering) β
ββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β Domain Processor β β Domain Processor β β Domain Processor β
β (Team A) β β (Team B) β β (Team C) β
β β β β β β
β Registers metadataβ β Registers metadataβ β Registers metadataβ
β for alert UI β β for alert UI β β for alert UI β
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
Each processor registered its metadata with the coordinator, which the alert management UI used to show customers what alerts were available.
What worked:
- Central location for audit and retry logic
- Each processor was a holistic domain processor owned by a single team
- Simple to understand: clear data flow, clear ownership
- Easy to reason about scaling at the coordinator level
Why it was abandoned:
- The implementation was poorly coded, leading to connection deadlocks
- The service wasnβt designed to scale horizontally
- Queues were not used properly, creating resource exhaustion
- A refactor was deemed too expensive
The architectureβs value was that it hadnβt overextended itself. A simpler design leaves room to evolve: you can fix the implementation, add layers incrementally, or migrate components without wholesale replacement. Rather than investing in that path, the organization chose to start over with something more complex. That decision proved costly.
Solution 3: Pipes & Filters
The third solution adopted a distributed pipes and filters architecture. Each actor had complete autonomy: its own authentication, its own AWS SQS access, and its own scaling behavior via Akka. A single central service handled subscriptions, metadata, and workflow registration, but communication between actors was distributed via SQS rather than synchronous API calls.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Central Registration Service β
β β
β Subscriptions β Metadata (UI) β Workflow (Parent/Child) β
β β
β All in-memory registration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
Actors register on startup (concurrency bugs)
β
ββββββββββββ΄βββββββββββ
β Event Source β
ββββββββββββ¬βββββββββββ
β
SQS
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Actor A βββSQSβββ Actor B βββSQSβββ Actor C β
β (SQS+Auth) β β (SQS+Auth) β β (SQS+Auth) β
β DynamoDB β β DynamoDB β β DynamoDB β
β Akka scale β β Akka scale β β Akka scale β
ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ
β β β
SQS SQS SQS
β β β
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Actor D β β Actor E β β Actor F β
β (SQS+Auth) β β (SQS+Auth) β β (SQS+Auth) β
β DynamoDB β β DynamoDB β β DynamoDB β
ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ
β β β
βββββββββββββββββββββββΌββββββββββββββββββββββ
β
SQS
β
βΌ
βββββββββββββββββββββββββββββ
β Notification Service β
βββββββββββββββββββββββββββββ
Instead of using AWS EventBridge for fan-out, a custom solution was built to register parent/child processor relationships via the central registration service. Actors registered themselves with this single service on startup, which introduced concurrency bugs and created a single point of failure for subscriptions, metadata, and workflow routing.
What worked:
- Clean, well-structured code with good unit tests
- Queue-based communication throughout
- Testable locally using LocalStack
- DynamoDB for ephemeral workflow data
What failed:
| Problem | Impact |
|---|---|
| Message bloat: Actors added entire new context to payloads instead of transforming them | Enormous network costs, bandwidth exhaustion |
| No centralized control: Each actor operated independently with no coordination | Impossible to audit, conflicted with financial compliance needs |
| Custom fan-out: Built parent/child registration instead of using EventBridge | Concurrency bugs during startup, unnecessary complexity |
| Shared infrastructure: Actors ran on the same tier as UI APIs | Processing spikes caused UI unavailability |
| DLQs unused: Dead letter queues were added but never monitored or processed | Lost events in a financial system |
| No observability: No centralized monitoring or alerting | Weeks passed before critical problems were detected |
| Noisy neighbors: One tenant could consume most of an actorβs processing capacity | Unfair resource allocation, SLA violations |
| SQS visibility timeout games: Constant tuning to account for processing latency | Fragile configuration, message reprocessing |
| Coupled deployments: Any change required updating every actor simultaneously | Massive deployment cost, developer burnout |
| No rollback capability: Pipes and filters provides no saga pattern | Failed workflows left partial state with no compensation |
After approximately two years, the accumulated failures led to SLA violations and significant customer payouts. Problems went undetected for weeks because there was no observability; by the time issues were discovered, the damage was done.
Why Pipes & Filters Was Wrong
The pipes and filters pattern makes specific assumptions about how data flows through a system:
| Pattern Assumption | Reality in This System |
|---|---|
| Each filter performs a stateless transformation | Actors added unrelated context; they didnβt transform |
| Filters are independent and composable | Actors required access to shared concepts and integrations |
| Scaling is per-filter based on throughput | Per-actor Akka scaling made bottlenecks invisible |
| Failure handling is per-filter | No saga support meant partial failures couldnβt roll back |
The pattern was selected without formal trade-off analysis. Development teams werenβt allowed to see the proposal, and no documentation explained why this architecture was chosen or what trade-offs were accepted.
Financial systems require audit trails; pipes and filters distributes control. The use case wasnβt transformation but enrichment and routing. Multi-tenant systems need fair resource allocation; per-actor scaling canβt provide it. Fan-out complexity pointed to EventBridge, not custom registration.
The Organizational Pattern That Never Changed
Across all three solutions, the same dynamic persisted. Stakeholders demanded βchangeβ every two years, but actual requirements were never crystallized in a way that could be validated. Each solution was a technical response to stakeholder frustration rather than a deliberate answer to clearly defined needs. Architecture decisions were made without documented rationale, and the teams building those systems werenβt given visibility into the choices shaping their work.
No architecture can fix an alignment problem. The pattern persisted because the organizational issue was never addressed.
What Should Have Been Built
Separate Concerns Cleanly
βββββββββββββββββββββββββββββββββββββββββ
β Alert Management UI β
β (Customer subscription mgmt) β
βββββββββββββββββββββ¬ββββββββββββββββββββ
β reads
βΌ
βββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Alert Feature Metadata DB ββββββββ Versioned Migration β
β (Available alert types, β β Tasks β
β subscription options) β β (No code deployment) β
βββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β No coupling - UI metadata is completely
β separate from event processing
β¨
βββββββββββββββββββββββββββββββββββββββββ
β Transaction Events β
β (From financial systems) β
βββββββββββββββββββββ¬ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS EventBridge β
β (Fan-out routing, no custom registration) β
ββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β Domain Processor β β Domain Processor β β Domain Processor β
β (Team A) β β (Team B) β β (Team C) β
β β β β β β
β Just processes. β β Just processes. β β Just processes. β
β No registration. β β No registration. β β No registration. β
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
Alert feature metadata and subscriptions live in a versioned database, updated via separate migration tasks rather than code deployments. The UI reads what alerts are available, customers subscribe through this service, and processors query it to determine which customers to notify. The subscription service is a stable API that processors consume, not something processors register with.
EventBridge handles fan-out routing through infrastructure-as-code rules, eliminating custom parent/child registration and the concurrency bugs that came with it. Processors donβt self-register on startup. When workflows span domains, eventful choreography replaces orchestration: each domain publishes events to EventBridge when its work completes, and other domains subscribe to what they care about.
Distribute Only What Needs It
Solution 2 got the team ownership model right: each domain team owns their processor. Distribution is an optimization, not a starting point, and there was never a proven need to distribute work beyond single domain processors. Start simple, measure actual bottlenecks, then optimize.
Queuing Strategy Depends on Ordering Requirements
For domains where events can be processed in any order, SQS Fair Queues provide automatic noisy-neighbor mitigation by setting MessageGroupId to the tenant ID. This would have solved the noisy-neighbor problem in Solution 3 with zero implementation effort.
For domains requiring strict ordering, SQS FIFO throughput constraints create back-pressure during spikes. The alternative is to persist events first (DynamoDB with TenantID as partition key and timestamp as sort key) and consume at a configurable rate per tenant. This gives FIFO semantics per tenant without FIFO queue constraints, makes backlogs visible for predictive scaling, and gives processors explicit control over tenant fairness rather than fighting queue configuration.
Domain-Centric Configuration and Observability
Configuration should be tied to business domain concepts, not component deployments. When support teams work with domain concepts instead of topology, component architecture can change freely underneath without breaking their mental model or their tooling.
And observability cannot be optional in financial systems. Each domain processor should audit all activity asynchronously, feeding into centralized monitoring. Problems found weeks after they begin are problems that trigger SLA payouts first.
Conclusion
The cycle could have been broken at Solution 2, not because Solution 2 was good, but because it hadnβt painted itself into a corner. A flawed but flexible architecture can be iteratively improved; an overextended one requires starting over.
Technical excellence matters, and Solution 3 had clean code and good tests. But technical excellence in service of the wrong pattern still fails. The architectural answers were available from the start: simple domain processors, actual measurements before distributing anything, and observability built in from day one. What wasnβt available was the organizational alignment to act on them, and no architecture could substitute for that.
Find this case study insightful? Share it with your network:
Share on LinkedIn