Processing high volumes of invoices efficiently while maintaining low latency, high availability, and business visibility is a challenge for many organizations. A customer recently consulted us on how they could implement a monitoring system to help them process and visualize large volumes of invoice status events.

This post demonstrates how to build a Business Event Monitoring System (BEMS) on AWS that handles over 86 million daily events with near real-time visibility, cross-Region controls, and automated alerts for stuck events. You might deploy this system for business-level insights into how events are flowing through your organization or to visualize the flow of transactions in real time. Downstream services also will have the option to process and respond to events originating within the system or not.

Business challenge

For our use case, a global enterprise wants to deploy a monitoring system for their invoice event pipeline. The pipeline processes millions of events per period, projected to surge 40% within 18 months. Each invoice must navigate a four-stage journey while making sure every event is visible within 2 minutes. End-of-month invoice surges reach 60,000 events per minute or up to 86 million per day. With payment terms spanning from standard 30-day windows to year-long arrangements, the architecture demands zero tolerance for missing events. Finance executives require near real-time visibility through dashboards, and auditors demand comprehensive historical retrieval.

Solution overview

The architecture implements a serverless event-driven system broken into independently deployable Regional cells, as illustrated in the following diagram.

Architecture overview showing the serverless event-driven system with independently deployable Regional cells

The solution uses the following key services:

Events flow through the layers of this architecture in four stages:

Design tenets

The solution adheres to three key architectural principles:

Scaling constraints

Our solution will experience the following scaling bottlenecks with quotas sampled from the us-east-1 Region:

We can safely scale a single account to 10,000 requests per second (600,000 per minute, 864 million per day) without increasing service quotas in the us-east-1 Region. Default quotas will vary per Region and the values can be increased by raising a support ticket. The architecture scales even further by deploying independent cells into multiple Regions or AWS accounts.

Scaling of QuickSight and Timestream depends on the computational complexity of analysis, the window of time being analyzed, and the number of users concurrently analyzing the data, which was not a scaling bottleneck in our use case.

Prerequisites

Before implementing this solution, make sure you have the following:

In the following sections, we walk through the steps for our implementation strategy.

Decide on partitioning strategies

First, you must decide how your solution will partition requests between cells. In our use case, dividing cells by Region allows us to offer low-latency local processing for events while keeping each cell fully independent from one another.

Inside of each cell, traffic flow is roughly evenly divided between the four stages of invoice processing. Our solution breaks each cell into four logical partitions or flows by invoice status (authorization, reconciliation, and so on). Partitioning offers the ability to fan out and scale resources independently based on traffic patterns specific to each partition.

To partition your cellular architecture, consider the volume, distribution, and access pattern of the events that will flow through each cell. You must allow independent scaling within your cells without encountering global service limits. Choose a strategy that allows each cell to be broken into 1–99 roughly equivalent partitions based on predictable attributes.

Implement the event routing layer

The event routing layer combines EventBridge for intelligent routing with Amazon SNS for efficient fanout.

EventBridge custom event bus configuration

Create a custom event bus with rules to route events based on your partitioning strategy:

Define a standard event schema for common metadata, including:

SNS topic structure

Create SNS topics for each logical partition:

Implement message filtering at the subscription level for granular control of which messages subscribing consumers see. Each topic can fan out to a large variety of downstream consumers that are also waiting for events that match the EventBridge custom event bus rules. Delivery failures will be retried automatically up to a configurable limit.

Event routing layer diagram showing EventBridge routing to SNS topics for fanout across partitions

Implement event producers

Configure API Gateway to receive events from existing systems with built-in throttling and error handling.

API design

Create a RESTful API with resources and a path for each logical partition inside your cell:

Implement request validation using a JSON schema for each endpoint. Use API Gateway request transformations to standardize incoming data and provide well-formatted error messages and response codes to clients in the event of failures.

Security and throttling

Implement API keys and usage plans for client authentication and rate limiting to prevent a talkative upstream from bringing down the system. Configure AWS WAF rules to protect against common attacks against API endpoints. Set up throttling to handle burst traffic (60,000 events/minute) at the account level and the method level.

Monitoring and logging

Our partitioned event producer strategy allows your solution to independently monitor each event type by:

Event producers diagram showing API Gateway configuration with throttling and monitoring

Implement event consumers

Implement durable processing using SQS queues with DLQs attached and serverless Lambda consumers.

SQS queue structure

Create SQS queues in front of each consumer to decouple message delivery and processing, in our case one per partition:

Set up DLQs for each main queue:

Event consumers diagram showing SQS FIFO queues with dead-letter queues for each invoice processing partition

Lambda consumers

Attach Lambda functions to each queue for custom processing of events:

Functions handle necessary transformations, call downstream services, and load events into Timestream. Double-check concurrency limits and provisioned concurrency to cover peak and sustained load, respectively.

Error handling and retry logic

Develop a custom retry mechanism for business logic failures and exponential backoff for transient errors. Create an operations dashboard with alerts and metrics for monitoring stuck events to redrive.

Build the business intelligence dashboard

Use Timestream and QuickSight to create real-time financial event dashboards.

Timestream data model

When modeling real-time invoice events in Timestream, using multi-measure records provides optimal efficiency by designating invoice ID as a dimension while storing processing timestamps, amounts, and status as measures within single records. This approach creates a cohesive time series view of each invoice's lifecycle while minimizing data fragmentation.

Multi-measure modeling is preferable because it significantly reduces storage requirements and query complexity, enabling more efficient time-based analytics. The resulting performance improvements are particularly valuable for dashboards that need to visualize invoice processing metrics in real time, because they can retrieve complete invoice histories with fewer operations and lower latency, ultimately delivering a more responsive monitoring solution.

Real-time data ingestion

Create a Lambda function to push metrics to Timestream:

QuickSight dashboard design

Develop interactive QuickSight dashboards for different user personas:

QuickSight dashboard showing real-time financial event monitoring with executive, operations, and finance views

Don't forget to implement ML-powered anomaly detection for identifying unusual patterns in your events.

Monitoring and alerting

Set up CloudWatch alarms for key metrics:

Configure SNS topics for alerting finance teams and operations:

Develop custom CloudWatch dashboards for system-wide monitoring:

Security

Add permissions in a least privilege manner for each required service listed in the architecture:

Encrypt data at rest and in transit:

Set up AWS Config rules to maintain compliance with internal policies:

Use AWS CloudTrail for comprehensive auditing:

Conclusion

The serverless event-driven architecture presented in this post enables processing of over 86 million daily invoices while maintaining near real-time visibility, strict compliance with internal policies, cellular scaling capabilities, and minimal operational overhead. This solution provides a robust foundation for modernizing financial operations, enabling organizations to handle the complexities of high-volume invoice processing with confidence and agility.

For further enhancements, consider exploring:


Grey Newell worked as an M.S.E. Distributed Systems and a Senior Solutions Architect at Amazon Web Services.