Technical coaching experience report: Part 1

Julkaistu aiheella Teknologia

Esko Luontola

Esko Luontola is an avid programmer and usability enthusiast. Esko has been using TDD in every project since 2007 and is also the author of the University of Helsinki’s TDD MOOC (tdd.mooc.fi). He is continually pushing for technical excellence and better software development practices.

20. elokuuta 2024 · 12 min lukuaika

During the past year, Esko Luontola has been helping one of our customers improve their developer productivity. This is the story of what kind of technical coaching he did and how it improved the team’s results.

I joined the development team of about 20 people at a multinational company. They had already recognized their deficiencies in test automation and wanted me to help the team improve their technical practices.

The starting point

After joining the team, I started developing features along with the others, while familiarizing myself with their current development processes.

I found that there was lots of bureaucracy centered around JIRA tickets and pull requests (PR). Each PR had to have a unique JIRA ticket, and they used a Confluence page to list all implemented JIRA tickets per application per release. Once per week the product owners would have a meeting where they discussed the JIRA tickets and their dependencies to other systems, to decide which of them could be included in the next QA or production release. If some dependencies were not ready to be deployed, some PRs would need to be reverted before release.

The release process relied heavily on manual regression testing. There were multiple dedicated testers which would go through their testing checklists weekly for each application in the QA environment. This was in addition to the effort put into testing new features. There were also automated browser tests, but they were flaky and took a long time to run. There was a separate Jenkins instance which ran the browser tests daily, but the developers were not actively looking at it, so the tests were often broken. The developers used GitHub Actions for running unit tests on every PR, but the browser tests were not included in that.

What makes teams perform well?

DevOps Research and Assessment (DORA) has studied the capabilities that drive software delivery and operations performance, publishing their findings in the annual Accelerate State of DevOps Report. The report lists many cultural and other aspects, but particularly famous are the four key metrics of software delivery performance:

Deployment frequency: How frequently is code deployed to production, to the end users? The top performers deploy multiple times per day.
Change lead time: How long does it take for a code change to go from being committed to running in production? For the top performers it’s less than one day.
Change failure rate: How frequently does a deployment cause a failure that requires immediate intervention? For the top performers under 5% of deployments fail.
Failed deployment recovery time: How long does it take to recover from a failed deployment? For the top performers it takes less than one hour.

(The 2023 edition replaced ”time to restore service” with “failed deployment recovery time” to better measure just the deployment activity, instead of mixing in events outside the team’s influence, such as an earthquake disrupting the data center.)

Image description: DORA clusters organizations into elite, high, medium and low performers based on these four metrics. Source: Accelerate State of DevOps Report 2023

These are some of the few metrics that are based on science, and are worth measuring and optimizing. The first two productivity metrics are balanced by the following two stability metrics – you need to go fast without breaking things. That’s why I wanted to start collecting them from the beginning of my involvement, to see what effect the process changes would have.

I wrote a script to collect the DORA metrics based on the commit and deployment history. For the most active project, deployment frequency was weekly, median change lead time 11 days, change failure rate 15%, and failed deployment recovery time 30 minutes. According to DORA’s cluster analysis, that’s medium performance level.

There were also production freezes during holidays and marketing campaigns, which would bump the median change lead time to 20-30 days. For less active projects, the long-term median change lead time was much slower, around 20-30 days, with bumps up to a couple months.

The pilot project

We chose one of the applications maintained by the team as a pilot project for improving test automation. It was a critical application responsible for half of the company’s income, but it also wasn’t the biggest codebase, so it seemed like a good proving ground for new development techniques.

The application had been implemented using TypeScript, React, Redux, TanStack Query and Next.js on the frontend, and Java and Spring Boot on the backend. The backend had quite a few JUnit tests, though the quality varied. The frontend had very few tests, mainly for some small helper functions. No tests for UI components. There were about 20 browser tests, but they were fragile, slow and often broken. The browser tests were not run as part of the automated build, so the developers rarely looked at them.

We split into a subteam of 3 people to focus on this application. At the start of the pilot, we had about 3 weeks that we could dedicate to improving the test coverage, which was very helpful in building the foundation.

First we went through the existing browser tests to discover the most important user flows through the system. The role of end-to-end tests should be to check that the pieces of the system are working together, whereas unit tests should be used to cover all the edge cases of individual components. Due to the cost of end-to-end tests – they are slow and require regular maintenance – their number should be limited.

We came up with two happy path scenarios that would cover the application’s every type of data and external service. We wrote them down as browser tests, and started running them as part of the automated build. It’s important to run the end-to-end tests as part of the automated build, before merging a pull request, so that the developers will fix them immediately and put in effort to make them faster and more stable.

Next, we listed all of the application’s features and wrote them down in a markdown file, to keep track of the parts which needed unit tests. Then, we started writing unit tests for individual React UI components, starting from the most important features and ticking them off the list as we progressed.

After the three weeks of improving test automation, we moved to normal feature development using the new practices. If a new feature required changing a component which didn’t yet have unit tests, we would first improve its test coverage to match our new standards, and then do the change. It might’ve taken a day to cover the legacy code with tests and refactor it to be testable, but after that making the feature changes was very quick.

Working as a true team

We worked together in mob programming style: “All the brilliant minds working together on the same thing, at the same time, in the same space, and at the same computer,” as Woody Zuill describes it. It’s the best method I know of for spreading knowledge within a team. I knew how to make maintainable tests, but knew nothing about the problem domain. Another person had been in the project since the beginning, knew the problem domain and why things had been done the way they had been, but knew very little about testing.

The mob approach combines the knowledge and skills of all the participants, breaking down silos and removing blockers. The knowledge transfer is further facilitated by strong-style pairing, which means that “for an idea to go from your head into the computer, it MUST go through someone else's hands.”

After three months of working this way, the other team members had become reasonably competent at writing tests, and I rarely needed to get involved.

Working in a mob is especially useful when no one person knows everything. During the first few weeks of test automation, our way of writing tests evolved rapidly, while discovering ways to work around the application’s design. And when adding new features, it eliminated communication overheads and avoided merge conflicts.

Unfortunately, not every stakeholder was part of the team – the SAP developers were a separate team, and we only corresponded with them during meetings. If they had been part of the mob, we could have avoided much miscommunication regarding a feature, and done in two days what eventually took two weeks.

We worked both remotely over video call, and in person. At the office, we would find a meeting room and reserve it for the whole day – the typical open office layouts are detrimental to collaboration. In an open office, the voices of unrelated people cause distractions, driving people to use noise cancelling headphones. In a dedicated team room, all the voices are related to the task at hand, improving focus instead of being a distration. At companies where all teams do mob programming daily, they would have dedicated mobbing setups, for example 1-3 big 80” 4K TVs and dual keyboards, so that everyone can see the screen easily and the driver position can be passed over quickly.

More frequent deployments

After the initial investment into test automation, we were able to get rid of manual regression testing. We could then deploy to production at any time. There was no need for quick fix processes, because even the normal process could get a change to production in under an hour, and we didn’t need to think about whether the main branch is deployable or not. If the automated tests passed, it could be deployed.

To do frequent deployments, multiple times per day, requires a mindshift from “N features per deployment” to “N deployments per feature”. We need the ability to deploy features to production before they are complete. This requires the use of feature flags, which make the new feature visible in the development environment, but hide it in production.

In its basic form, a feature flag could just hide the link to a new page. In more complex situations, for example when we made big changes to a component, we would make a copy of the whole component. We would add the word “doomed” to the old component’s name, so that we could find and delete it easily after the feature is finished.

The deployment process was already mostly automated, but it still included unnecessary manual steps. For example, a human had to set the version number and trigger a release build. The deployment scripts would create an AWS CloudFormation change set, but a human had to click around the AWS console to apply the change set. There were also manual processes like posting a message to Slack when doing a deployment, and updating a Confluence page.

It took about a week to build a new deployment pipeline able to get the code from commit to production without any human intervention. On every push to the main branch, the new deployment pipeline will:

Assign a unique version number to the build. To keep things simple, we used just the current date and an incrementing build number.
Build the application, run its tests and package it in a Docker image.
Start the Docker image locally with Docker Compose, and run the end-to-end tests against it. For external services, it will use services from the dev environment.
If all is good, the Docker image will be pushed to AWS ECS and the version will be tagged in Git.
Next it will do a deployment to dev, qa and prod environments, in that order, if the previous deployment succeeded:
1. Optionally wait for human approval before deployment.
2. Send a Slack notification about the deployment. Mention the old and new version numbers. (The script reads the currently deployed version from the application’s status page.)
3. Deploy the new version.
4. Poll on the application’s status page until it shows that the new version is running, and that there are no failing health checks.
5. Generate a GitHub Wiki page which shows all version numbers, the merged PRs and JIRA issues that were added in a version, and which version is currently running in which environment.

This enabled getting rid of all manual steps in the deployment process. The other team members were hesitant to move quickly, so at first we only deployed automatically to dev and qa environments, and kept prod deployment behind manual approval. But even with manual approval, the project’s DORA change lead time metric improved from 20-30 days down to 4 days.

The goal is to eventually move to continuous deployment, so that every change will automatically go all the way to production. That should reduce the change lead time to 1-3 days. Moving forward, were we to get rid of pull requests and start doing continuous integration (a.k.a. trunk-based development), we could reach the top performer levels of a change lead time of just a few hours.

One benefit of deploying more frequently is that it makes deployment more reliable. We noticed that some AWS health checks were too strict and would restart a server just after starting it, because the application hadn’t yet warmed up and the health checks sometimes timed out when the server started receiving traffic. After some tuning, we could safely deploy even in the middle of the rush hour, without affecting any users.

Frequent deployments also mean that every deployment contains only a few changes, so the probability of the changes breaking production is low. And if something goes wrong, it’ll be quick to locate the cause among the recent changes. The code will still be a fresh memory for the developer who made the change, which will enable fixing the problem faster.

Conclusions

By improving test automation, the team was able to get rid of manual regression testing and deploy much more frequently. After positive experiences in the pilot project, the test automation improvements are now being scaled to other projects.

Here is a refined order of steps for improving a project’s test automation, to bring the biggest benefits forward, and how long they typically take:

Begin collecting the DORA metrics to discover the current situation and to begin optimizing the software delivery process. (~1-2 days)
Set up a fully automated deployment pipeline. Getting code changes to production shouldn’t take more than a button click. (~1 week)
Begin using feature flags to deploy unfinished features safely to production. (~1 day)
Deploy to production daily, after manual approval. Start collecting information about bugs which leak to production, and do root cause analysis to improve development practices and the deployment pipeline.
Add a couple end-to-end tests to the build pipeline, to cover the most important user flows and integrations to external systems. Delete all other end-to-end tests. (~1-5 days)
Enable continuous deployment to production. (~5 minutes)
Culture change for test automation. Work as a whole team in the mob programming style to spread test automation skills, so that all the developers get better at writing unit tests also for UI components and other hard parts. (~1-6 months)

The next part of this article will go into more detail regarding the test automation techniques that were involved and developed during this project.

Read the second part of this article!

Esko Luontola