Skip to content
Ship the Spec

· acceptance-criteria · ai-assisted-delivery · gherkin

The AI prompt for acceptance criteria that gives you testable Thens

Ask ChatGPT or Copilot for acceptance criteria and you get 'the system handles it correctly' — untestable slop. Here is the one rule that fixes the prompt, with a before/after on a real story.

Search “AI prompt for acceptance criteria” and you will find a dozen listicles that hand you some variant of “Write acceptance criteria for this user story: [paste].” Run it. You get back something like this:

Scenario: User requests a quote
  Given a valid user
  When they submit a quote request
  Then the system handles the request correctly
  And the appropriate response is returned

That is not acceptance criteria. It is a paraphrase of the story with Given/When/Then punctuation. A developer cannot build against it, a tester cannot fail it, and it will get bounced back the moment refinement starts — now with the added insult that a machine wrote it and you pasted it without reading.

The problem is not the model. The problem is that “correct” and “appropriate” are the two words a language model reaches for when it does not know what should actually happen — and the generic prompt never forces it to know.

The one rule: every Then must be observable

An acceptance criterion earns its place only if a person or a test can watch the system and say yes, that happened or no, it didn’t, with no interpretation in between. Concretely, every Then has to resolve to one of four observable things:

  • A screen state — a specific element, label, badge, or field value visible to a specific role.
  • A message — an exact string, error code, or notification the user or caller receives.
  • A stored value — a record, flag, or status that persists and can be queried afterwards.
  • An emitted event — a call, a webhook, a message on a queue, a row in an audit log.

“The system handles it correctly” is none of those. It describes the developer’s intention, not the system’s behaviour. The fix is to make the prompt refuse it — to instruct the model that a Then which is not one of the four categories above is a defect, not a criterion.

That single constraint is the difference between a prompt that produces documentation theatre and one that produces criteria you can hand to a developer without a follow-up meeting.

Before and after, on a real story

Here is a story from Parcelio — a fictional B2B parcel-logistics platform I use for every worked example on this site, so nothing resembles a real client. Coordinators quote shipments across three carriers (Nordfreight, TransIber, VeloCargo), and some parcels carry dangerous-goods surcharges.

As a logistics coordinator, I want dangerous-goods surcharges applied and shown on my quote, so that the price I book is the price I’m invoiced.

The generic prompt (“Write Gherkin acceptance criteria for this story”) gives you three scenarios that all bottom out in “the surcharge is applied correctly” and “the quote is accurate.” Untestable. Worse, the model silently assumes there is exactly one surcharge and one carrier, because you gave it nothing else.

The disciplined prompt does three things the generic one skips: it feeds the model the business rules as explicit bullets, it demands one scenario per rule, and it forbids the word “correctly.” Here is what the same story returns:

Scenario: Nominal — DG surcharge itemised per carrier
  Given a coordinator requests a quote with 2 parcels, parcel B flagged DG with UN number 1263
  When the quote is returned
  Then the Nordfreight option shows a surcharge line "Dangerous goods — parcel B: €12.00"
  And the TransIber option shows a surcharge line equal to 4% of parcel B's base price
  And each option's total equals base prices + its surcharge line

Scenario: Carrier that refuses DG is excluded
  Given a quote request contains at least one DG-flagged parcel
  When carrier options are returned
  Then no VeloCargo option appears
  And the quote response includes exclusion reason code CARRIER_DG_NOT_ACCEPTED for VeloCargo

Scenario: DG flag without UN number rejects the quote
  Given a quote request where parcel A is flagged DG with no UN number
  When the coordinator submits the request
  Then no quote is created
  And the response returns error code QUOTE_DG_UN_NUMBER_MISSING referencing parcel A

Read the Then lines. A visible surcharge line with an exact label and amount — a screen state. A returned exclusion code — a message. “No quote is created” — a stored value. Every one is something you can watch happen. There is nothing for a developer to interpret and nothing for a tester to argue about. That is the whole game.

Notice one more thing the disciplined output does: where a rule was genuinely undecided — whether a re-quoted surcharge after expiry is recalculated or held from the original — it wrote both scenarios and tagged them [AMBIGUOUS — decide] instead of inventing an answer. A prompt that is allowed to guess will guess, and it will guess plausibly enough that you ship the guess. A good prompt is instructed to surface the ambiguity and stop.

The one thing the prompt cannot fix

Here is the uncomfortable part, and it is the reason “just use AI for acceptance criteria” quietly fails on real teams: the quality of the criteria out is capped by the quality of the business rules in.

The disciplined prompt above works because it was fed real rules — flat €12 for Nordfreight, 4% for TransIber, VeloCargo refuses DG, no UN number means rejection. Feed the same prompt a story with no rules attached and it has two options: return vague Thens (“the surcharge is applied”), or invent specific ones (“a 15% surcharge applies”). The second is more dangerous, because an invented rule that looks precise sails through review. Nobody bounces “15%” — they bounce “correctly.”

No prompt engineering closes that gap. If the surcharge percentages live only in a pricing lead’s head, or in a wiki three versions out of date, or in nobody’s head at all, then the model is not the bottleneck — the undecided rule is. The most valuable output of a good AC prompt is often not the criteria. It is the short list of things it couldn’t write a testable Then for, because that list is your real backlog of decisions to go get.

So the workflow that holds up is not “paste story, get criteria.” It is:

  1. Write the story.
  2. Bullet every business rule you actually know, and mark the ones you are guessing.
  3. Prompt for one scenario per rule, every Then observable, ambiguities flagged not filled.
  4. Read the flagged ambiguities. Those are your questions for refinement — go answer them before the story enters the sprint, not during it.

The AI compresses steps 1 and 3 to minutes. Step 2 is yours, and step 4 is where the story is actually saved. That is the honest version of “AI for acceptance criteria” — not a machine that knows your domain, but a fast, tireless drafter that exposes exactly where you don’t.

Get the prompt

The full Gherkin acceptance-criteria prompt — with the “every Then must be observable” constraint, the per-rule scenario rule, the ambiguity-flagging instruction, and its known failure modes — is prompt 05 in the free slice: 10 documented AI prompts for POs, drafted from a production AI-assisted specification pipeline, not from a listicle. Each ships with when to use it, what to paste in, a worked Parcelio example, and where it breaks. It is email-gated and free.

Subscribe and get the free slice →

— Pierre K.