Skip to content
Event Sourcing 13 min read

Event Sourcing Beyond the Tutorial: The Real Problems You Hit When You Build for Production

Everyone teaches you what event sourcing is. Nobody teaches you what happens next. The real engineering problems you hit when you build for production.

Everyone teaches you what event sourcing is. Nobody teaches you what happens next.


You’ve read the blog posts. You’ve watched the conference talks. You understand event sourcing - the append-only log, the aggregate, the projection. It clicks. It makes sense. You decide to build it.

And then, somewhere between “I get it” and “this is running in production,” everything gets hard in ways nobody warned you about.

I’ve spent a long time in this space - first as someone who hit every one of these walls, then as someone building tooling to help other engineers avoid them. What follows isn’t a theoretical overview of event sourcing. It’s a map of the problems the tutorials skip. If you’ve already read three introductory posts and want to know what comes after, this is the one I wish had existed when I first tried to build this properly.


The Tutorial Trap: Why 95% of Event Sourcing Content Stops Before It Gets Useful

The BankAccount aggregate is everywhere. Deposit. Withdraw. Balance. The event log fills up. The projection rebuilds. It works. The tutorial ends.

The problem is that the tutorial ends exactly where the real engineering begins.

Most event sourcing content is cargo-culted from the same handful of blog posts, which were themselves inspired by a handful of conference talks, which traced back to a small number of practitioners who actually built production systems. Somewhere in that chain of derivation, all the hard parts got dropped. What survived is the concept - elegant and correct - without any of the implementation detail that would make it actionable.

This isn’t a critique of the people who wrote those posts. The why of event sourcing is genuinely compelling, and explaining it clearly is valuable work. But the result is an information landscape that’s long on motivation and brutally short on implementation. You can find a thousand explanations of what an aggregate is. You will struggle to find a single thorough guide to what happens when you need to evolve your event schema in a system that’s been running for two years.

The tutorial BankAccount is a lie of omission. It has one aggregate type, one projection, no schema evolution, no concurrency concerns, no cross-aggregate queries, and a stream with maybe twelve events in it. Every production system you will ever build violates at least four of those constraints before the first deployment.

The concept is solid. The gap is between understanding it and building it. That gap is what this post is about.


Aggregate Design in the Real World: Getting Boundaries Wrong Is Expensive

The hardest decision in event sourcing isn’t a technology choice. It’s a boundary choice.

An aggregate boundary is a consistency boundary - the scope within which you guarantee that operations are atomic and ordered. Everything inside an aggregate can be treated as a single consistent unit. Everything outside it cannot.

The textbook guidance is deceptively simple: model your aggregates around your transactional consistency requirements. In practice, this means you need to understand your domain’s transactional requirements before you write a single line of code, which is often the thing you’re still figuring out when you’re writing the first line of code.

Get it wrong, and you pay a tax that compounds over time.

If your aggregate is too large - say, an Order that includes every line item, every status change, every payment event, and every fulfilment action - you end up with enormous, contended streams. Every concurrent write to the same Order risks an optimistic concurrency conflict. Your event stream becomes a kitchen sink where unrelated concerns are tangled together because they happened to the same entity.

If your aggregate is too small - say, splitting OrderPayment and OrderFulfilment into separate aggregates because “they’re different concerns” - you lose consistency across operations that should be atomic. Now you need two-phase commits, sagas, or process managers just to do things that should be simple. Welcome to distributed systems, whether you planned for it or not.

The rule-of-thumb that “one command modifies one aggregate” is useful but breaks down in real domains. Consider an inventory reservation: when an order is placed, you want to decrement available stock atomically with recording the order. That’s two aggregates. There’s no elegant solution here - you have sagas, eventual consistency, or you reconsider your aggregate design. All three are legitimate; none are free.

The practical heuristic I keep coming back to: model aggregates around invariants, not entities. What is the smallest scope within which all the business rules must hold simultaneously? That’s your aggregate boundary. An Order aggregate isn’t “the order entity” - it’s “the scope within which the business rules about order state are enforced.”

The reason this matters so much in event sourcing specifically: your event log is history. You can’t meaningfully refactor aggregate boundaries after the fact. If you decide six months in that Payment should be its own aggregate rather than part of Order, you’re either rewriting history (which violates the whole point) or building a migration layer that reconciles two incompatible event structures. Neither is fun. Getting boundaries right early is worth the design investment.


Projections That Actually Work: Replay, Scale, and the Catch-Up Problem

Projections sound simple. Reduce a stream of events into a view. A SELECT that never touches the write side.

In a hello-world system with one aggregate and one projection, they are simple. In a system that’s been running for eighteen months with forty-seven event types and a projection that serves a read-heavy API endpoint, the simplicity gets complicated.

Subscription management is the first thing that bites you. A projection needs to subscribe to an event stream and stay current. When does it start? From the beginning of time, or from a checkpoint? If the projection process crashes and restarts, does it know where it was? How do you store that checkpoint? What happens if the checkpoint state and the projection state become inconsistent - if the projection’s read model is ahead or behind where it thinks it is?

These aren’t hypotheticals. They’re regular operational failures in any system that treats subscriptions as an afterthought.

The catch-up problem is related. When you deploy a new projection - or when you need to rebuild an existing one from scratch because you introduced a bug or changed the schema - you need to replay events from the beginning of the stream. For a system with a year of history and a high event volume, that replay can take minutes or hours. During that time, your projection is out of date. What do you do with queries that hit it? Do you serve stale data? Block? Have two versions of the projection live simultaneously?

There’s no universally correct answer. But you need an answer, and you need it before you’re staring at a lagging projection at 2am.

Idempotency is the third issue. During replay, events will be processed multiple times. Your projection reducer needs to be idempotent - processing the same event twice must produce the same result as processing it once. For simple additive projections, this is easy. For projections that maintain counters, aggregated totals, or derived state, idempotency is a constraint you have to design for deliberately.

Out-of-order delivery is rare in well-implemented event sourcing because events in a single stream have a guaranteed sequence. The problem appears when your projection spans multiple streams - when you’re building a cross-aggregate read model. Now you have to reason about relative ordering across streams that have independent sequence numbers. Wall clock time is not a reliable tie-breaker. This is where you start to appreciate why having a single source of ordered, globally-sequenced events is not just a convenience but a correctness requirement.


Schema Evolution Without Rewriting History: Versioning Events the Right Way

Your events are immutable. Your understanding of the domain isn’t.

At some point - guaranteed - you will need to add a field to an event, rename a field, remove a field, or split one event type into two. And your existing event log contains thousands of instances of the old version.

There are three main approaches, each with genuine tradeoffs.

Upcasting is the most common. You write a transformation function that converts an old event version to the current version at read time. The raw storage stays unchanged; your application never sees the old format. This is elegant and keeps your domain code clean.

public static class OrderPlacedUpcaster
{
// V1 had a single "CustomerName" string.
// V2 splits it into FirstName and LastName.
public static OrderPlacedV2 Upcast(OrderPlacedV1 v1)
{
var nameParts = v1.CustomerName.Split(' ', 2);
return new OrderPlacedV2(
OrderId: v1.OrderId,
FirstName: nameParts[0],
LastName: nameParts.Length > 1 ? nameParts[1] : string.Empty,
PlacedAt: v1.PlacedAt
);
}
}

The limitation: upcasters accumulate. If you’ve gone through five versions of an event, you’re chaining five transformations. Each one adds cognitive overhead and a potential transformation bug. You also need to test the full upcast chain, not just the latest version.

Versioned event types are more explicit. OrderPlacedV1, OrderPlacedV2 are distinct types in your domain. Your event handlers dispatch on the full type name. New code handles V2; old code handles V1. You can deprecate V1 once all readers have been updated.

This is clear but verbose. You end up with a proliferation of event types and switch statements that need updating every time you add a version.

Copy-and-transform is the nuclear option - you run a migration that reads your existing event log, transforms the events, and writes a new log with the updated schema. This is the only approach that actually gets rid of old event versions.

The cost: it’s a one-way door. You’re rewriting history, which is philosophically uncomfortable and operationally risky. You need to migrate all downstream systems simultaneously. If anything goes wrong, rollback is painful. Save this for genuine breaking changes where accumulated upcaster complexity has become unmanageable.

Most teams should start with upcasting. It handles the majority of schema changes gracefully. The important thing is to have a strategy before you need it - because the alternative is discovering, in production, that your event deserialiser is throwing exceptions because a field you removed six months ago is still referenced in your projections.


The Infrastructure You Didn’t Sign Up For: Storage, Ordering, and Exactly-Once Delivery

This is the section that explains why DIY event sourcing kills teams.

To build event sourcing correctly, your storage layer needs to provide some specific guarantees. They’re not exotic, but they’re precise:

  • Sequential, ordered writes within a stream - events within an aggregate’s stream must be written and read in order, always
  • Optimistic concurrency - you must be able to say “append this event only if the stream is currently at version N”, and have that fail atomically if it isn’t
  • Stream isolation - a write to stream A must never affect the read consistency of stream B
  • Durable appends - once an event is written and acknowledged, it must survive process crashes, network partitions, and reboots

These requirements rule out naive implementations faster than you’d expect.

“Just use DynamoDB” is a common first instinct. DynamoDB is fast, scalable, and serverless. It also has no native concept of ordered appends within a partition beyond sorting by a sort key, no built-in optimistic concurrency enforcement, and eventual consistency by default unless you’re careful with read options. You can implement all of these guarantees yourself - many teams have - but you’re writing a small distributed systems library before you’ve written a line of business logic.

“Just use Kafka” is a different kind of wrong. Kafka is an excellent distributed log for event streaming, but it’s designed for consumer offset management and high-throughput fan-out, not for the fine-grained, per-aggregate consistency guarantees that event sourcing requires. You can use Kafka as an output from an event sourcing system - broadcasting events to downstream consumers is exactly what it’s good at - but it’s not the right primary storage layer.

The ordering problem deserves special mention. Wall clock time is not a reliable event sequence. Two events written in the same millisecond, or written to different nodes in a distributed system, cannot be reliably ordered by timestamp. You need a monotonic sequence number per stream, ideally with a globally-ordered event position for cross-stream queries. Implementing that correctly under load, without creating a serialisation bottleneck, is a non-trivial distributed systems problem.

I’ve talked to teams who spent three to six months building reliable event storage infrastructure before they shipped a single domain feature. That’s not a failure of engineering skill - the problems are genuinely hard. It’s a failure of the ecosystem to provide a better answer.


What It Looks Like When the Plumbing Just Works

I want to be concrete about what the other side looks like, because I think it’s achievable and worth aiming for.

When the infrastructure is solved, working with event sourcing is genuinely pleasurable. You write a reducer - a pure function that takes the current state and an event and returns the new state. You push it. The platform handles storage, concurrency, and sequence. You write a projection - a function that transforms a stream of events into a read model. You push it. The platform subscribes it, keeps it live, and handles catch-up when it falls behind.

When you need to change your event schema, you write an upcaster, register it, and move on. The platform runs it transparently at read time. Your domain code only ever sees the current version.

Cross-aggregate projections that would require three joins and a batch job in a relational system are just another subscription. The read model is whatever you need it to be, not whatever your write model happens to look like.

None of this is magic. It’s a well-specified set of infrastructure concerns, solved once, run reliably. The reason it feels good is that every unit of engineering effort goes into the domain - the thing that’s actually different about your product - rather than into the plumbing underneath it.


The Part Where I’m Honest About the Landscape

There are existing tools worth knowing about.

EventStoreDB (recently rebranded to Kurrent) is the incumbent. It’s a serious, purpose-built event store with a long track record. If you’re evaluating options, it deserves respect. The honest constraint is that you still own the infrastructure - you’re running a database, which means you’re handling upgrades, backups, clustering, and monitoring. Their cloud offering exists, but it’s a managed version of a complex system, not a simplified one.


Marten is a .NET library that gives you event sourcing on top of PostgreSQL. It’s clever and genuinely useful if you’re already on Postgres. Marten has been around for over 10 years, has a large contributor community, and JasperFx offer commercial support plans for production users.


DIY on DynamoDB, Kafka, or a relational database: as above. Legitimate choice. Realistic scope.

None of these options are wrong. They’re tradeoffs, and the right choice depends on how much infrastructure you want to own and operate long-term.


What You Should Take Away From This

Event sourcing is worth the investment. I genuinely believe that - not as a marketing position, but as an engineering conviction. The audit trail, the temporal query capability, the decoupling of write models from read models, the ability to replay history with new projections: these are real advantages that pay off over the lifetime of a system.

But the investment has to go into the right thing. Spending six months building reliable event storage infrastructure is not the investment. That’s a tax you pay to get to the starting line.

Every problem in this post is solvable. The patterns exist - aggregate boundary design, projection subscriptions, schema evolution through upcasting, correct storage guarantees. The solutions exist. The question is whether you want to own all of that infrastructure, or whether you want to spend that time on the business logic that actually differentiates your product.

That’s a genuine decision, and it’s worth making consciously rather than discovering the answer after six months of debugging event ordering bugs at 2am.


If you’d rather push your reducers and projections and let someone else own the infrastructure underneath them, that’s exactly what I built Hapnd for. It’s in beta right now - no credit card, no sales call, no commitment. If it sounds like the right fit, sign up at hapnd.dev and be part of what we’re building.