Taming backend complexity: lessons from a decade of TDD

Julkaistu aiheella Teknologia

Esko Luontola

Senior Software Architect

Esko Luontola is an avid programmer and usability enthusiast. Esko has been using TDD in every project since 2007 and is also the author of the University of Helsinki’s TDD MOOC (tdd.mooc.fi). He is continually pushing for technical excellence and better software development practices.

17. syyskuuta 2025 · 20 min lukuaika

In the first part of his article series, Esko Luontola, Senior Software Architect at Nitor, explored user interface testing. In this second article, he dives into the often-overlooked challenge of testing database access code – where stateful data, complex queries, and side effects make testing significantly trickier.

You can find the first part here.

Testing backend systems isn’t just about writing tests – it’s about designing software that’s inherently testable. And it goes to the core of well-designed systems. As Michael Feathers observes, there’s deep synergy between testability and good design. When writing tests become painful, it’s often a sign of deeper structural issues: tight coupling, hidden complexity, or unclear responsibilities.

From my experience, testability and good design form a powerful feedback loop. When you design for testability, you naturally produce cleaner, more maintainable code – and as your design improves, testing becomes easier in turn.

In this article, I’ll share insights from years of working on enterprise applications and experimenting with persistence strategies. From test-driven development to document databases and event sourcing, these lessons aim to help you design backends that are easier to test and evolve.

My TDD journey

I started using Test-Driven Development (TDD) in 2007. For the first year, I was mainly focused on learning how to name tests. I had read Dan North's article introducing Behavior-Driven Development (BDD), and my main takeaways were: "Test method names should be sentences" and "What’s the next most important thing the system doesn’t do?"

My goal was to write test names that were that descriptive. It took me about 6-12 months and seven projects to become proficient at naming tests.

For the next few years, I practised TDD regularly, and everything was flowers and sunshine. But after about five years of using it, I started noticing some recurring difficulties.

Don't hide the pain

I like to think of TDD in terms of both direct and indirect effects. The direct effects come from simply following the rules of TDD mechanically:

✅ Guarantees code coverage: With TDD, you only add the minimal code to make each test pass. By definition, this means you won't add untested code.
🤕 Amplifies the pain caused by bad code: TDD requires writing code iteratively, instead of writing it all in one go. We define bad code as code that is hard to change – so if you’ve written something difficult to modify, you’ll discover it quickly the moment you need to update it.

Amplifying the pain is a good thing if you respond to it correctly. The following indirect effects require some skill and awareness of your development process:

⛑ Enables changing code without breaking it: Code coverage helps you notice when something breaks. But to evolve the current design into your desired new design, you'll need the skill to identify a viable refactoring path – and the discipline to follow it carefully in small steps.
💎 Improves code quality: When code is difficult to change, you need to analyse why your design resists change. Instead of ignoring the pain, you need to come up with a better, more adaptable design.

Observed pain points in enterprise applications

After using TDD for five years, I started noticing one pain point repeatedly: application core domain classes tend to get messier over time. Those classes are used all over the code and they accumulate multiple responsibilities: business logic, persistence, APIs, read operations, write operations. These use cases can mostly be served by the same domain classes, but sometimes they apply conflicting design pressures.

A typical example: an API which must expose all fields of a domain class, except for one or two that must be kept private. Or, you may need to expose some fields as read-only via the API. To solve this in an object-oriented language, you will either need to write new classes for each use case and copy data between them, introducing a lot of duplication, or then you will reuse the same classes across multiple use cases, increasing complexity.

🤕 Data attracts multiple responsibilities
🤕 One model to serve multiple needs

Another recurring challenge: testing code that accesses the database is hard.

First of all, database access slows tests down. If it takes longer than 5 seconds to run all tests, the human working memory begins forgetting what you were working on. Then you either risk losing your train of thought, or you start running only a subset of tests and missing regressions until much later, leading to frustrating debugging sessions.

Database tests also require attention to setting up and cleaning up the test data. If the database schema has deep dependency chains, you might need to populate many tables just to test one thing. Maintaining a shared set of test data can be burdensome – and it couples tests together. To avoid test pollution and random failures, you must clean up after each test. That again takes time, so sometimes developers may resort to hacks to avoid resetting the entire database between tests.

Object-relational mapping (ORM) is another eternal dilemma. If you do the mapping manually, you end up writing a lot of code to convert between object graphs and relational databases. On the other hand, if you use an ORM tool to automate it, you inherit its complexity. The code may work in subtly different ways with and without the ORM tool, so the tool may creep in as a required dependency for all tests, slowing them down and increasing setup friction.

🤕 Slow tests
🤕 Test data setup
🤕 Test cleanup
🤕 Object-relational mapping

In-memory vs. "real" database

I've seen multiple projects where the team used a different database engine during development than in production. For example, the production environment uses PostgreSQL, but the tests are run using HSQLDB. HSQLDB has features to make it compatible with PostgreSQL, so in theory, they should behave the same, right?

Why do teams create such an abomination, stitched together like Frankenstein's monster? One reason could be that this way, you don't need to install and run a real database server to run the tests. They may also think that tests would run faster against a database that isn't backed by files on disk.

Back in the day, the licensing costs of commercial databases might have prevented you from installing them on every developer's computer. The rise of open-source databases (e.g. PostgreSQL and MySQL) alleviated that problem, but today we see a return to the past with cloud platforms. Some databases now exist only as managed cloud services, making it impossible to install them locally.

Subtle differences between databases

Developing an app using a different database than the one in production introduces the risk of subtle differences. Even if the SQL syntax matches perfectly, there can still be differences in transactions and other non-functional requirements. So you will anyway need to run all tests using both database engines, which only adds complexity to the build and test setup.

If the database queries need to run against multiple database engines, you will either need to restrict yourself to using the common subset of their features or make multiple variants of the queries. One example of such a database-specific feature is PostgreSQL's range types. With PostgreSQL, you can write a query like this:

WHERE range @> tsrange(:start, :end)

Whereas with generic SQL, you would write:

WHERE (start IS NULL OR start <= :start)
  AND (end IS NULL   OR end >= :end)

Not only are range types easier to use, but they also perform better, because you can create a GiST index on them. (With separate start and end columns, it's practically impossible to create indexes that perform well in every situation.) With range types, you can even create constraints to ensure that ranges do not overlap.

Avoiding database-specific features in the name of cross-database compatibility is simply not worth it: you would be missing out on a lot of benefits.

In-memory databases are not always faster

What about the claim that an in-memory database is always faster than a full-fledged database? I've measured this in real projects, and it’s simply not true.

For example, since HSQLDB runs in the same process as the Java application, the class-loading overhead of the database slows down every test run measurably. Even HSQLDB's startup overhead alone makes it slower than a PostgreSQL server that is already running in a separate process.

I have also doubts about other performance claims. And even if disk access was a real bottleneck, you could still disable fsync or use a RAM disk to make it faster while still using the same database engine as production.

Is post-test cleanup, then, any easier with in-memory databases? Not really. Even though no data remains on disk after running the test suite, you still need to ensure isolation between tests. So regardless of the database, each test must clean up after itself, typically by dropping and recreating the database schema. So the advantage here is negligible.

What about the goal of making it easier to run the tests? Nowadays, you can achieve the same goal with containers. You can define the database in a Compose file and start it with docker compose up -d db. Another method is to use Testcontainers. With containers, there’s no need to install the database server directly on your laptop. Plus, you can run exactly the same version of the database locally as you do in production, improving testing confidence.

Database engines: conclusions

After investigating the use of different database engines in tests vs. in production over several years, here are my conclusions:

✅ Run tests out-of-the-box
❌ Restricted to compatible SQL subset, or
❌ Forced to duplicate implementations for each DB
❌ Slower test performance
➡️ Use Docker: docker compose up -d db

Looking back at the original pain points, it's debatable whether this strategy has actually helped with test cleanup. The problem of slow tests, on the other hand, may have gotten worse.

🤕 Data attracts multiple responsibilities
🤕 One model to serve multiple needs
🤕🤕 Slow tests - WORSE?
🤕 Test data setup
🤕/💊 Test cleanup - IMPROVED?
🤕 Object-relational mapping

DIY fake database

We can replace the database in unit tests with a fake implementation. To do this, you implement the same interface as the data access object (DAO) or repository, but use a hashmap instead of a database:


Map<Long, Thing> thingsById = new HashMap<>();
AtomicLong idSeq = new AtomicLong(0);

public void save(Thing thing) {
    if (thing.id == null) {
        thing.id = idSeq.incrementAndGet();
    }
    Thing persisted = thing.copy();
    thingsById.put(persisted.id, persisted);
}

public Thing getById(Long id) {
    Thing persisted = thingsById.get(id);
    if (persisted == null) {
        throw new NotFoundException(id);
    }
    return persisted.copy();
}

This kind of code is mostly very generic, making it easy to reuse across multiple DAOs. However, domain-specific queries (i.e. anything beyond get-by-id and list-all queries) must be implemented for each DAO. To ensure the fake DAO behaves exactly like the real DAO, the fake DAO should be tested using contract tests.

Tests using a fake database are super fast to run. They also don't require any test cleanup since they use plain old objects that the garbage collector will clean up automatically. You also don’t need to significantly redesign your application, so they are easy to get started with. However, fake databases are a leaky abstraction.

A real-world example: hotel reservations

I once worked on a hotel reservation system where the relational database had a long dependency chain via foreign keys: Hotels have rooms. Rooms can be occupied or vacant depending on the date ("room timeline"). Additionally, hotels are grouped together into hotel chains:

room timeline --> room --> hotel --> hotel chain

As I was implementing the logic for room timelines, which tracked the dates a room was occupied or vacant, I noticed something interesting:

Room timelines didn't need to know anything about the room besides the room's ID. For the fake database implementation, the room timeline tests could simply use random room IDs without actually creating any rooms. But because the relational database had foreign keys, it was not possible to use the same shortcut. Before I can add an entry to the room timeline, I must first add a room to the database. Before adding a room, I must add a hotel. Before adding a hotel, I must add a hotel chain.

Foreign keys exposed the leaky abstraction. The fake database and the real database behaved differently in subtle but important ways.

As a result, tests passing with the fake database didn’t necessarily guarantee correctness against the real one. Fakes and mocks are a crutch that can reduce the pain of some dependencies, but they don’t eliminate the dependency itself.

In this case, I decided to simplify testing room timelines against the real database by temporarily disabling the foreign key only for this test:

ALTER TABLE room_timelines DROP CONSTRAINT IF EXISTS room_timelines_room_id_fk;

This kind of ad-hoc database schema change requires that the test suite will recreate the schema between tests. It slightly reduces the testing confidence, but at least a failed foreign key check will be clearly visible when it happens. Hopefully, an end-to-end test will notice if the foreign key constraint was violated in any situation.

Fake databases: conclusions

To summarise, hashmap-based fake databases are very fast and have some nice benefits regarding code reuse, but they also introduce complexity and leaking abstractions.

✅ Fast
✅ In-memory DAOs are mostly generic code
❌ …except for duplicated custom queries
❌ Writing SQL feels even more demeaning
❌ Contract tests add complexity
❌ Leaky abstraction

This strategy solves slow tests and test cleanup. But since the fake in-memory DAOs must be compatible with the real relational database DAOs, some pains inevitably leak through. Test data setup needs to be done using the DAO's API, so it's very similar to setting up test data in the real database. Though the fake DAOs can avoid ORM problems, the real DAOs are still the same as before.

🤕 Data attracts multiple responsibilities
🤕 One model to serve multiple needs
💊 Slow tests - SOLVED
🤕/💊 Test data setup
💊 Test cleanup - SOLVED
🤕/💊 Object-relational mapping

Document database

What does object-relational mapping code look like in real life? Consider the following data class from a real production system:

This Facility class has 17 fields, some of which contain more fields. I think it's a fairly typical example of a deeply hierarchial core domain concept in an enterprise application. What do you think the corresponding FacilityDao class looks like?

The FacilityDAO class image is over 700 lines long.

The DAO is over 700 lines long! It's that long despite supporting only a few operations: insert, update, get-by-id, and a search based on four searchable fields. I've highlighted in color which parts of it deal with creating/reading/updating the database. Let's take a closer look at just one piece of the code:

Facility piece of code, the complexity introduced by 1-to-N relationships.

Here you see the complexity introduced by 1-to-N relationships.

For example, if the facility's pricing list changes, the DAO will first call deletePricing() and then insertPricing(). Since individual pricing rows don't have their own identity, they are always saved together with the top-level entity (also known as the aggregate root).

Another common concern in enterprise applications is history tracking. In this example, whenever the facility's status changes, the change is recorded in a separate table via facilityHistoryRepository.updateStatusHistory().

Whenever you see such big domain concepts stored in a dozen tables in a relational database, ask yourself whether those dozen tables could instead be one document. With a document database, the FacilityDao could easily shrink to around 50 lines of code. You wouldn't need to serialise each field separately, but just store the entire object as-is. Most of the remaining DAO code would then handle history tracking and search. (Schema verification and version migration would need to be done in the application code. That’s the price you pay for simplicity.)

You don't need to switch away from a relational database to use a document database. Most relational databases support a JSON data type. With PostgreSQL, you can even create indexes over individual fields inside the jsonb data type. There is also the option of denormalising parts of the JSON document into relational fields within the same database row.

Benefits of document databases

Using a document database avoids the ORM problem completely. It also helps test data setup thanks to relaxed data consistency requirements: you can create partial documents with just the fields your test cares about. You also won't need to worry about transitive dependencies because of foreign keys.

🤕 Data attracts multiple responsibilities
🤕 One model to serve multiple needs
🤕 Slow tests
💊 Test data setup - SOLVED
🤕 Test cleanup
💊 Object-relational-mapping - SOLVED

Document databases are also easy to replace with a hashmap-based fake database in tests, which gives you even more benefits: faster tests and no test cleanup needed.

That improvement eliminates four of the original pain points:

🤕 Data attracts multiple responsibilities
🤕 One model to serve multiple needs
💊 Slow tests - SOLVED
💊 Test data setup - SOLVED
💊 Test cleanup - SOLVED
💊 Object-relational-mapping- SOLVED

Document databases: conclusions

To summarise: document databases can greatly simplify the data persistence layer in enterprise applications. The trade-off is that schema verification and data migration must be handled in application code. The public API of DAOs stays the same regardless of whether it's backed by a relational or document database – for better and worse.

But could we do better by changing the paradigm entirely?

✅ Simpler DAOs
✅ Better match for many applications
✅ Mix and match relations and documents
⚠️ Schema verification and data migration in application code
😐 Basically same paradigm as before

Event sourcing

Event sourcing is a persistence mechanism pattern where, instead of storing and updating the current state of your application, you store all domain events that led to that state. To calculate the current state, you simply replay all the events in order.

This technique has been used successfully for thousands of years in accounting, banking, law, and many other domains.

For example, your bank account is event sourced: transactions are only ever appended to your bank account, never removed. The current balance of your bank account can be calculated by summing all transactions since the account was created.

Architecture overview

Applications using event sourcing look like the diagram below. I will not go into details of each component, but the key thing to point out is that you can create as many projections of events into models as you like. Typically, you would have a write model for validating commands, and at least one read model to serve queries.

You can create different models optimised for different use cases, and decide for each what would be the best place to store the model (e.g. in memory or in a database).

This separation of concerns improves maintainability, because code that serves one use case can be decoupled from the code serving another use case. Events are the master data, and in my experience, they rarely change. You'll sometimes get new events or add fields to existing events, but far less frequently than you would change the tables of a relational database. (For more details on the evolution of event schemas, see Versioning in an Event Sourced System.)

Why event sourcing works

This stability of events is probably because they are based on business events instead of how the application works. Projections and commands, on the other hand, are made to serve the needs of the application, so they will change when your application's user interface changes. You may create new projections for displaying data in new ways, or new command types to let users do more things, but they’ll usually reuse already existing event types.

Thanks to event sourcing, a piece of code only needs to serve one use case. Many use cases use same events, but because events are stable, use cases are decoupled from one another. Data won't attract multiple responsibilities when you can easily create a new projection that emits data, focusing on just one responsibility. This keeps each piece simple.

Testing event-sourced systems is pleasant. Code interacts with events, but it doesn't care where the events come from. Events don't need to be stored in a database to use them – just represent them as simple lists. Projections, command handlers and query handlers can be pure functions, which are easy to unit test. I typically end up with one namespace of common test events, which I reuse in 80% of tests, with only small variations.

You’ll only need minimal code coupled to the database: a single DAO for appending events to an event log (a database table) and reading them. That's only a couple of hundred lines of code, and you only need to write it once, because it's a fully generic DAO and not tied to any use case or domain concept.

Pain points solved

Event sourcing effectively solves all the original pain points:

💊 Data attracts multiple responsibilities - SOLVED
💊 One model to serve multiple needs - SOLVED
💊 Slow tests - SOLVED
💊 Test data setup - SOLVED
💊 Test cleanup - SOLVED
💊 Object-relational mapping - SOLVED

Additional benefits and trade-offs

Event sourcing also brings new possibilities. Built-in history tracking can be useful for analytics and auditing. If something goes wrong, it's always possible to go back in time to see where and when it happened, instead of trying to reverse engineer previous states from the current state and log files.

It also makes bi-temporal systems (e.g. separating when something happened from when it was recorded) manageable: you can easily view the data as "what the reality was at time T" vs. "what we knew at time T". You can choose optimal storage per use case: you can scale out databases or keep everything in memory for extreme performance. It also enables real-time replication to external systems.

Even then, it's not a free lunch. Event sourcing is different from CRUD, so there is a learning curve. You may need to implement more things yourself, though the necessary infrastructure is so simple that you shouldn't need a framework.

Also, not everything should be event sourced. A good example is personally identifiable information (PII) and privacy laws. Because events are immutable, it can be hard to remove data from old events. By default, I store PII via CRUD model, referencing it by pseudonymized IDs. Another technique would be to encrypt PII data in events with a private key, and then discard the key to “forget” the data. Yet another way to solve it is organising events into streams (e.g. one per user) and support deleting entire streams when needed.

Process managers

One area of event sourcing I’d like to explore further is process managers: components that listen for events and issue commands in response.

They can do simple stuff, like sending email messages, or more complex tasks, like coordinating business processes (e.g. shipping and refunds in e-commerce).

I've implemented process managers as projections that build a “to-do list”, but I’d like to gain more experience with other implementations and use cases.

Final thoughts on event sourcing

Of all the approaches we’ve covered, I find event sourcing the most promising.

I have built two production systems using it, plus a few smaller projects. Working in an event-sourced system is simply more pleasant than working in a CRUD-based system.

The individual parts are simpler, but the code is distributed differently, so there is a learning curve to new developers. It also requires careful planning around domain events, PII storage, and eventual consistency.

✅ Decoupled use cases
✅ Unit testable
✅ History tracking
✅ Opens interesting architectural possibilities

⚠️ Learning curve: it's different from CRUD
⚠️ Needs more DIY
⚠️ Storing PII needs planning
🤔 Process managers

Summary

TDD tends to teach developers a lot about design.

You can learn the basics of TDD in 100-200 hours, but good design is a lifetime journey. It's important to not hide the pain, but to reflect on it and seek designs that are less painful – testable and decoupled.

This article covered insights from over a decade of experience about experimenting with persistence mechanisms in enterprise applications:

Using a different database in test than in production is clearly a bad idea. The other ideas are more contextual:
Faking the database can ease some pain points but introduces others.
Document databases are a better fit for many applications than normalised relational models.
Event sourcing can simplify many complex systems and open up new architectural possibilities.

Hopefully, these lessons will accelerate your learning – and save you a few years of trial and error.

Esko Luontola

Senior Software Architect