Technical coaching experience report: Part 2

Julkaistu aiheella Teknologia

Esko Luontola

Esko Luontola is an avid programmer and usability enthusiast. Esko has been using TDD in every project since 2007 and is also the author of the University of Helsinki’s TDD MOOC (tdd.mooc.fi). He is continually pushing for technical excellence and better software development practices.

23. elokuuta 2024 · 9 min lukuaika

During the past year, Esko Luontola has been helping one of our customers improve their developer productivity. This second part of the experience report goes into the technical details of the test automation improvements.

In the first part of this article, I went through the technical coaching and process improvements at a high level. This second part will explain some of the technical details involved in the pilot project’s test automation.

Testing the untestable

Each codebase is different, and the design choices and use of frameworks affect how to write tests. In this case, the UI code was tightly coupled to React, Redux, TanStack Query, and Next.js. If some code is a pure function, it’s easy to test – just call the function with parameters and check the return value. But when most of the code is coupled to the above-mentioned UI frameworks, putting tests in place after the fact is hard.

React components cannot be tested as pure functions. Even if they are called “functional components,” in reality they are objects that depend on the global state. We used the React Testing Library to render and interact with them. If a React component takes all of its input data as props and there is no asynchrony, it is relatively easy to test.

Redux makes testing slightly harder since the Redux store is a global variable that is read and mutated all over the place. Thankfully, it’s relatively easy to isolate the Redux store for each test. Typically, we would set the current value of the store at the beginning of the test so that it was one kind of indirect input (and output).

TypeScript makes testing harder because often, the types have far more required fields than the code being tested actually uses. A key aspect of maintainable tests (and writing in general) is to emphasize the relevant parts and de-emphasize the irrelevant parts. For example, if the test is focused on showing the data from one field, the test code shouldn’t populate other fields. TypeScript has a Partial type which makes an object’s fields optional. We expanded that to a RecursivePartial, which does the same thing but recursively. That way we can leave out most fields in the test data, and cast it to the full type when passing it to production code.

TanStack Query makes testing harder due to all the asynchrony. Nearly all of the UI components used the useQuery hook for data fetching. Mocking useQuery was discussed but quickly dismissed. Since the beginning, a basic rule of mocking has been to “only mock types you own.” You shouldn’t mock libraries that someone else has written because your mocks might not work the same way as the library, and breaking changes would be extra painful.

We ended up using the Nock library for mocking at the http request level. It made testing some things possible, though not easy. Our guidelines were to not check whether a Nock mock was completed, but instead wrap the rendered HTML’s assertions in “await waitFor”. Every test also needed an “afterEach” block, which asserts that “nock.pendingMocks()” is empty and then calls “nock.cleanAll()” to avoid Nock’s mocks leaking between tests.

Mocks should be the last tool in the toolbox. We used Jest’s mocking features only for a couple of things. One is the “useRouter” hook from Next.js for query parameters and navigation, which coupled lots of UI components to Next.js but was thankfully a small enough API to break the dependency with mocks. The other thing to mock was one of our custom React hooks that returned which feature flags were active. Feature flags are used to hide incomplete features in production, which is a prerequisite to frequent deployments. The application also uses feature flags to enable country-specific customizations, so we had to test some features with their feature flags both enabled and disabled.

Stringly asserted tests

In previous projects, I had found it useful to unit test UI components by reading the HTML element’s “innerText” property and asserting the text that the user sees (ignoring any whitespace differences). I used the same technique in this project. However, because this project’s tests ran on Node.js with jsdom, which doesn’t implement “innerText,” I resorted to stripping all HTML tags with a regex.

This approach makes UI tests more readable because the test code will match what the user sees. When a string equality comparison fails in tests, the test runner can show a substring diff of the expected and actual values, so you can quickly see which words were different. Using a single equality assertion for checking the whole result also has the benefit that in addition to testing what was shown, it also asserts that nothing extra was shown (unlike e.g. substring checks).

A breakthrough came when we had to also test non-textual information, for example testing which status icon color is displayed. I came up with a “data-test-icon” attribute and a regex which would make the value of that attribute visible in tests. Then we could replace SVG icons with Unicode emoji characters in tests, which made testing such visual information very easy.

The “data-test-icon” attribute turned out to be even more useful than we thought. It could also be used to visualize the values of form fields. We used it to have different visualizations for a text field with a value “[1.23]” and an empty text field with a placeholder value “[(1.23)].”

The above example demonstrates unit testing UI components using this approach. You can find some example code for doing the same things yourself at the end of the Test-Driving HTML Templates article. This testing approach doesn’t have an established name, but Martin Fowler suggested calling it "stringly asserted", so let’s go with that for now.

More stable end-to-end tests

By “end-to-end test,” we mean tests which exercise a fully deployed application, together with all the external services that it uses. End-to-end tests are good for checking that all the components are connected together, but they are not good at checking edge cases. This is primarily because end-to-end tests are very slow and complex. That’s why even big projects should limit themselves to just a few (less than 10) end-to-end tests to cover the most important happy path user flows through the application. Unit tests should be responsible for checking basic correctness: does the code do what we think it should do (ignoring any network issues and other external dependencies).

The pilot project’s old browser tests had multiple issues, which made them unnecessarily slow and unstable.

Because the UI’s form fields save their changes after a 250ms debounce delay, the tests shouldn’t move to another page before the change is saved, or the change will be lost. The old tests worked so that they waited a moment each time they changed a field. The tests also typed all input at “human speed,” one character at a time. I removed all those waits so that the browser test would click and input things as fast as the CPU could. Then I added a tactical pause to just a couple of places where the test had to wait for all changes to be saved. I made that waiting period possible by showing a loading spinner whenever a field had unsaved changes. Previously the spinner was shown only while a network request was in progress, but not during the debounce delay before it. The browser tests could then easily wait until there were no loading spinners.

Changing the testing framework from Cypress to Playwright also improved things:

In browser tests, nearly every operation does I/O when it communicates with the browser. This makes JavaScript an exceptionally bad language for writing browser tests because it doesn’t have blocking operations but only async I/O. Cypress has invented its own language where all function calls are queued and executed asynchronously, and because of that decision, they’ve also had to reinvent basic programming language concepts such as conditionals, loops, and variables instead of just using JavaScript. With Playwright, you can just write normal JavaScript, though you’ll need to sprinkle the await keyword all over the place (hopefully, your IDE warns about missing await keywords).

Cypress doesn’t support applications that span multiple domains. Our application requires the user to sign in via Auth0, which will redirect to a different domain. Thus, the original tests had a hack that generated a JWT token in the tests and, that way, bypassed the login, but that also meant that the sign-in was never tested. Cypress has added some support for cross-origin testing with cy.origin(), but it was buggy at best and didn’t work with every web browser. Playwright just commands the web browser normally, and crossing domains is a non-issue.

Cypress records a video of the test run, and compressing that video takes quite a long time. On an AWS EC2 m6a.large instance, the video compression makes the test run take about 25% longer. Playwright takes screenshots for its own replayable file format, which is much faster than recording a video. Playwright’s test replay also has a much better UX than Cypress, in particular because it shows what line of test code is being executed.

Speed was increased further by redesigning the end-to-end tests to run in parallel. In unit tests, each test can have its own copy of the world so that they can run in isolation. But for end-to-end tests, that is too costly. End-to-end tests need to be isolated based on the data they touch. For example, each test could sign in with a different user account. In this application, there was another option: shopping carts. A user could have multiple shopping carts, so each test could create its own shopping cart and use that. That way, we didn’t need multiple test user accounts. I just had to write a slightly smarter cleanup code for removing old shopping carts from failed test runs – it should remove old shopping carts only if they were older than an hour, as not to remove a shopping cart that was still in use by another test.

All in all, this reduced the end-to-end test run time by 94%. The original big set of Cypress tests took about 25 minutes to run on AWS EC2 m7i.2xlarge. Running just the two chosen end-to-end tests took 8 minutes on AWS EC2 m6a.large (of which video compression took 2 minutes). After redesigning and rewriting in Playwright, the tests took only 1½ minutes. It could be further reduced to under 1 minute by making the application faster: One of the external services refreshes its data once per minute, and the tests need to wait for that. We could redesign that service to update its data in near real-time, so the tests wouldn’t need to wait. It should also make the users happier.

Conclusions

This article explained some techniques for unit testing the user interface and for more productive end-to-end tests. If you’d like to learn more about test automation, have a look at the Test-Driven Development MOOC at the University of Helsinki, which was sponsored by Nitor.

Esko Luontola