Your Backlog Can Be Shared Working Memory For Humans And AI

Hypothesis Driven Development turns story work into shared working memory for humans and AI

The most expensive part of a handoff is rebuilding intent.

Not status. Not the ticket number. Not which branch the code is on.

Intent.

What were we trying to learn? Why did the acceptance criteria change? Which assumption was still open? What did the last person discover before they had to stop?

On a real team, that context is scattered: Jira, Slack, meeting notes, screenshots, and memory. AI coding agents face the same problem: they only know what they can see. If the real story is scattered across tickets, docs, chat, browser tabs, and memory, the AI gets a thin slice of the work and the team still has to replay the rest.

This is the shared-context problem.

HDD Isn't Just A Planning Template

Hypothesis Driven Development has been around for a while. Barry O'Reilly's Thoughtworks article, How to Implement Hypothesis-Driven Development, frames software work as an experiment: state the hypothesis, define the expected outcome, decide what signal will show whether it worked, run the experiment, and use the learning to decide what happens next.

HDD is useful because it pushes teams away from order-taking.

Instead of treating a story as "build this thing," the team treats it as "we believe this change will create this outcome, and we will know by seeing this signal."

The part I care about most right now is what happens after the hypothesis is written.

Does the work stay understandable tomorrow?

Can a developer pick it up after lunch without asking someone to replay the last meeting?

Can an AI coding assistant read enough context to help without hallucinating intent?

Can a manager or product lead see the decision path without forcing the team back into status-reporting mode?

This is where HDD becomes more than a planning ritual. It becomes a way to create shared working memory.

The Story Is The Memory

For HDD to work beyond the planning conversation, the story has to carry the pieces people usually have to reconstruct:

acceptance criteria
hypothesis
success metrics
assumptions
demo plan
design notes
diagrams
implementation plan
comments
attachments
feedback and retrospection

These things are valuable because they let the next person, or the next AI assistant, recover the state of the work without asking someone to replay it from memory.

The backlog becomes shared working memory for humans and AI.

A Concrete Example

On my project at Flexion, the problem wasn't "Jira is bad."

The problem was flow recovery.

Teams lost context when work paused, priorities shifted, people rotated, or conversations happened across Slack, Jira, meetings, personal notes, and memory. Handoffs depended too much on whoever touched the story last.

In our HDD pilot, a story was given a predictable structure and worked through in phases:

Define the outcome.
Prove the outcome.
Confirm the outcome.

The workflow captured acceptance criteria, success metrics, demo notes, design notes, implementation planning, and open assumptions directly in the story folder.

The important part wasn't that the team produced more artifacts, it was that the story became resumable.

When the session stopped and started again later, the next step was visible. The assumptions were still there. The plan was still there. The evidence gate was still there. A human could resume. An AI assistant could resume. The team didn't need to reconstruct the whole story from scattered context.

That's a different kind of backlog.

Why This Matters More With AI

DORA's 2025 State of AI-assisted Software Development frames AI as an amplifier of an organization's existing strengths and weaknesses. That matches what I see in day-to-day delivery work.

There's a stronger research thread underneath this than "AI needs better prompts." Team cognition research has been studying how teams build the mental state they need to anticipate and coordinate. In a 2021 meta-analysis, Niler, Mesmer-Magnus, Larson, Plummer, DeChurch, and Contractor synthesized 107 independent studies with 7,778 participants. The graph uses rho, which is the correlation coefficient from that meta-analysis, so higher bars mean a stronger positive relationship between team cognition and team performance. The overall relationship is positive, and it gets stronger in conditions that sound a lot like modern software work: high external interdependence, temporal dispersion, and geographic dispersion.

xychart-beta
  title "Team cognition is associated with team performance"
  x-axis ["Overall", "External", "Time", "Location"]
  y-axis "rho" 0 --> 0.5
  bar [0.35, 0.41, 0.36, 0.35]

Source: Conditioning team cognition: A meta-analysis, Organizational Psychology Review, 2021. In the graph, "External" means high external interdependence, "Time" means temporal dispersion, and "Location" means geographic dispersion. This isn't direct proof that HDD improves delivery. I read it as evidence for the underlying bet: making team context explicit and shared is a real lever, not just a documentation preference.

If the team already has clear context, AI can amplify that.

If the context is scattered, stale, or implicit, AI can amplify that too.

This is why I don't think the leadership question is only "Which AI coding tool should we standardize on?"

A better question is:

If a developer or AI assistant picked up this story tomorrow, would they know the outcome, current decision path, and evidence gate without asking someone to replay it?

If the answer is no, the team has a shared-working-memory problem.

What The HDD Skill Adds

The HDD skill is a guide for keeping that memory useful. It pushes for the smallest slice that can prove the outcome. It asks for the hypothesis, what success looks like, and records assumptions. It creates a demo plan early and keeps it current as work changes.

When design is needed, it captures it. When implementation happens, it uses red/green/refactor so the next person sees the test-first intent, the smallest green change, and the cleanup step. And crucially: when the work changes, it records why instead of leaving that reasoning only in chat.

That last part matters most. A lot of delivery risk isn't in the code. It's in the unrecorded reasoning around the code.

What I Want Engineering Leaders To Look For

You don't need this workflow for every tiny task. Use it where context loss is expensive: vague stories, multi-day work, multiple hands, likely handoffs, AI collaboration beyond narrow edits, or when the decision path matters as much as the final code.

Then inspect one active story and ask:

Could a teammate pick this up tomorrow?
Could an AI assistant read the real intent?
Is the success signal visible?
Are the open assumptions written down?
Is the next step clear without asking the author?

If not, the issue isn't story quality, it's shared memory quality.

Try It On One Story

Pick one story that's currently at risk of becoming a handoff problem.

Write the hypothesis.

Write the success signal.

Capture the assumption that would make the story fail.

Add the smallest demo plan.

Then ask whether the story is easier for a human and an AI assistant to resume.

If you want to see the workflow, I recorded a short walkthrough of how I use the HDD skill with imdone: https://youtu.be/GE48aDZwfPQ

If you want to try it in a repo, install imdone-cli from npm.

The goal is less replay.

Your backlog should help your team remember what it's trying to learn.