Property testing with eris: the bug nobody wrote • Jonhathan Rodas

There’s a class of bug your tests will never catch — not because they’re bad, but because they were written with the same head that wrote the code. If the author misunderstood a rule, they misunderstand it in the implementation and in the test that checks it. Green on both sides, wrong in production.

A few weeks ago, while hardening the tax module of an accounting ERP, I ran into one of those bugs. I wasn’t looking for it. It was found by a technique I’d been meaning to try for real: property-based testing. This is the story of what it gave me, what it didn’t, and how it ended up wired into the workflow of two projects at once.

The example versus the law

A traditional test asserts a single case: “for an invoice of 100 with 12% VAT, the total is 112.” You pick the numbers. The catch is that you only verify the cases you thought of.

Property-based testing flips the relationship. Instead of giving a case, you declare a law that must hold for any input, and the tool generates hundreds of cases trying to break it. In PHP that tool is eris.

A real example, over a pure function that splits an amount into equal installments:

use Eris\Generator;
use Eris\TestTrait;
 
uses(TestTrait::class);
 
it('splits a total without losing a cent, for any amount and term', function () {
    $this->forAll(
        Generator\choose(1, 100_000_00), // amount in cents: 0.01 .. 1,000,000.00
        Generator\choose(1, 120),        // number of installments
    )->then(function (int $cents, int $periods) {
        $total = bcdiv((string) $cents, '100', 2);
        $rows  = (new StraightLineAmortizer)->amortize($total, $periods, 2026, 1);
 
        // LAW: the installments sum EXACTLY to the total. No rounding drift.
        $sum = array_reduce($rows, fn ($a, $r) => bcadd($a, $r['amount'], 2), '0.00');
        expect(bccomp($sum, $total, 2))->toBe(0);
    });
});

I didn’t write a single case. I wrote the law — “the installments sum to the total” — and eris tried hundreds of combinations of amount and term. The real value shows up when something fails: eris does shrinking, reducing the failing case to its minimal reproducible form (say, total=3.00, installments=3) and handing you an ERIS_SEED to reproduce it deterministically. Instead of “failed with 84729 and 47 installments,” you get the smallest possible counterexample, served on a plate.

The real case: two calculators that didn’t talk to each other

The ERP computes Guatemalan income tax, which is progressive by brackets. And it had two different classes computing those brackets, written at different times:

Withholding used an absolute model: for each bracket, it taxed the slice of the base between the bracket floor and its ceiling, slice = min(base, ceiling) − floor.
Monthly tax used a consumption model: it “spent” the base against each bracket’s width, cap = ceiling − consumed, and never read the bracket floor.

For Guatemala’s real data, where the first bracket starts at zero, the two agreed. But I imagined a bracket table with an exempt initial band — the first bracket starting at, say, 1000 — and the numbers diverged:

With brackets [1000, 2000) at 5% and [2000, ∞) at 7%, over a base of 1500:

Correct (marginal): only 500 fall in the taxed bracket → 500 × 5% = 25.00.
Monthly tax: it taxed 1500 × 5% = 75.00, ignoring the first 1000 that should have been exempt.

Twenty-five versus seventy-five. The consumption model assumed brackets always start at zero. It’s a country assumption disguised as code: true for Guatemala today, false the moment you extend the catalog to another jurisdiction with an exempt minimum — exactly the kind of extensibility the product is built for.

The honest part: the bug was latent. With production data it never fired, because the real tables start at zero. No example test was ever going to find it, because every example used those same tables. The bug lived in the gap between what the code assumed and what the code promised.

The fix was to unify: a single pure function, ProgressiveBracketCalculator, that respects the bracket floor and is order-independent, consumed by both calculators. And the properties stay on guard so it can’t drift apart again.

The uncomfortable lesson

Before declaring victory, I ran the same loop on mature, well-tested modules — sales, accruals — with several money calculators. Result: eris green, zero bugs. I added mutation testing (which injects deliberate errors into the code and measures whether the tests catch them) and the score was a misleading 66%: nearly all the “survivors” were equivalent mutants — changing the scale of a BCMath operation from 2 to 3 decimals over values that already have 2 decimals doesn’t change the result, so no test can kill them. Noise, not gaps.

Out of that came the thesis I found most useful:

The value of property testing is inversely proportional to the quality of the tests you already have.

On mature, well-covered code, it confirms what you already knew (and that counts too: it gives you confidence over hundreds of cases instead of five). On new or heavily branched code — brackets, regimes, country-specific rules, rounding — that’s where it catches real bugs. Don’t point it at your CRUD; point it at your money logic.

The trigger: when to write a property

The practical rule, the one that actually matters, is being able to finish this sentence:

“For ANY input, ____ must hold.”

If you can name the invariant, it’s a candidate. If you can’t, stick with example tests. The laws that recur most:

Conservation: Σ parts == total (money in BCMath, never float).
Monotonicity: more base ⇒ the tax never goes down.
Idempotence: f(f(x)) == f(x).
Round-trip: undoing what was done returns the original state (void(post(x)) leaves the balances as before).
Bounds: 0 ≤ result ≤ limit.
Order independence: reordering the input doesn’t change the result.

And one non-negotiable condition: pure functions only. If the thing needs a database to run, property testing isn’t the tool (and it gets slow anyway). Fortunately, a healthy architecture already pushes money logic into pure services that take arrays or DTOs.

How we wired it into the workflow (without friction)

The hard part of property testing isn’t the how — the pattern is thirty lines — it’s the when. So we wired it in three layers, with no new machinery:

A written convention. The trigger (“for any input, ___”), pure functions as the only target, and the equivalent-mutant caveat all went into the repo’s testing conventions. Everyone — or every agent — who joins the project sees it.
In the cycle. During the TDD step, if a stage produces a pure function with a law, the property gets written next to the examples, right then. Property tests are plain tests: the runner already discovers and runs them, zero extra infrastructure.
In the generator. The project scaffolds new modules from a manifest. That generator now seeds a property-test stub in every module — a commented skeleton with the list of laws — so writing the first property is renaming a file, not remembering a practice.

Mutation testing stayed a one-off audit, once per new core, never a CI gate: the equivalent-mutant noise would do more harm than good if it blocked every commit.

The same concept, across both stacks

Property testing isn’t a PHP thing; it’s a portable idea. Only the tool changes per stack. We took it to both fronts.

In the ERP (PHP / Laravel), eris, as we saw.

In the portal and data lake (TypeScript / Nuxt), fast-check:

import fc from "fast-check";
 
test("the invoice total conserves the cents", () => {
  fc.assert(
    fc.property(
      fc.array(fc.record({ qty: fc.nat(1000), priceCents: fc.nat(10_000_00) })),
      (lines) => {
        const totals = calcTotals(lines);
        // LAW: the subtotal is exactly the sum of the lines.
        expect(sumLineTotals(totals)).toBe(totals.subtotal);
      },
    ),
  );
});

In the Python functions on Lambda of the data lake, Hypothesis:

from hypothesis import given, strategies as st
 
@given(st.lists(st.integers(min_value=0, max_value=10_000)))
def test_margin_never_exceeds_revenue(amounts):
    result = compute_margin(amounts)
    # LAW: the margin lives between zero and total revenue, for any set.
    assert 0 <= result <= sum(amounts)

Three languages, the same discipline: declare the law, let the machine hunt for the counterexample.

When yes, when no

Yes: money logic, branched computations (brackets, country rules, FX rounding), greenfield code, algorithm refactors.
No: CRUD, code that only orchestrates the database, and as a CI gate before filtering out equivalent mutants.

The best part of the whole experience wasn’t catching the bug. It was realizing it was the kind of bug no example was ever going to touch: it lived in the space between what the code assumed and what it promised. Property testing doesn’t replace your tests; it completes them right where your imagination — and the imagination of whoever wrote the code — doesn’t reach.