Refactor Behind a Behavior Test

Refactor behind a behavior test

A refactor is a bet that behaviour stays the same. The only way to collect on the bet is a test that passes before the change and after it, written against the public interface so it survives the rewrite. No test means no bet, just hope.

The loop

For every refactor, in order:

  1. Find (or write) the behaviour test that covers the thing you're about to change.
  2. Run it. Confirm green before you touch the code.
  3. Make the change.
  4. Run it again. Still green means behaviour held.

Step 2 is the one people skip, and it's the one that matters. If the test was already red, or doesn't actually exercise the path, you learn that before you've entangled it with your change instead of after.

The test has to survive an implementation swap

The check for whether a test is worth anything: could you throw out the implementation, rewrite it with a different algorithm and different internal types, and have the test still pass unchanged? If yes, it tests behaviour. If it breaks when the internals change but the behaviour doesn't, it's coupled to the implementation and it punishes the exact refactor you're trying to do.

A concrete one from Spotuify: swapping a bounded FIFO from Vec + remove(0) to VecDeque (see Use VecDeque for Bounded FIFOs). The existing test asserted that after pushing five items into a cap-two buffer, the snapshot held the two newest in order. That assertion knows nothing about Vec vs VecDeque — it's pure behaviour, so it covered the swap for free. Green, change, green, done.

When there's no safe test, defer — don't refactor blind

The harder discipline. Some of the highest-value changes have no test you can run safely in your environment. In spotuify the worst-blocking-in-async offender was the auth token refresh, which reads the real macOS keychain. There's no fake token-store seam, and exercising it risks a keychain prompt storm. So the right spawn_blocking fix was untestable in the loop above.

The move is not "do it carefully anyway." It's: write down the finding, the fix, and the test harness it needs, and defer it. A documented deferral is a better outcome than an unverified change to the hottest path in the system, especially when the brief was "don't break it." Deferring with a reason is engineering; refactoring blind is gambling with someone else's production.

Why this pairs with idiomatic work

"Make it idiomatic" tends to produce sprawling diffs. Each individual change is small and safe, but a hundred of them at once is unreviewable and any one could regress. Gating every change behind a before/after behaviour test keeps the diff honest: each step is provably behaviour-preserving, and the ones that can't be proven get parked, not smuggled in. See Idiomatic Rust Rubric.

See also