Beyond the System Prompt: Engineering ‘Clinical Memory’ into LLMs
Memory is the missing layer between conversation and understanding
Most AI products today have the memory of a goldfish.
They either forget everything, or they store fragments of chat history and hope retrieval does the rest. This typically isn’t an issue for simple use cases, but it breaks quickly when context actually matters.
Therapy is one of those domains.
A therapist doesn’t operate on chat history. They build a model of the person, something that evolves over time. Patterns, emotional triggers, relationship dynamics, personal history. None of this is extracted from a single conversation. It’s built gradually, session after session.
Most LLM systems don’t build contexts in this way. They remember conversations, not people. Until now.
To make this work, we had to introduce a hard constraint: the chat doesn’t build memory. Memory comes from structured therapist sessions, where signal is higher and context is curated. The chat only reads that state—it doesn’t create it.
This is the gap we ran into while building ORA, the chat co-pilot we’re developing at OurRitual. The goal isn’t to replace therapy, but to extend it, supporting users in the moments between sessions, when something surfaces and needs to be processed, but there’s no therapist in the room.
When trying to create a tool with clinical value for users seeking relationship support, that distinction is the root of the problem.
The common approach to “memory” looks reasonable at first. Store past messages. Extract a few facts. Retrieve them later. Inject them into the prompt.
But this breaks in predictable ways.
The system loses track of time. It surfaces irrelevant details. It connects unrelated events. It starts telling a story that isn’t actually true.
You’ve probably seen this. A user once mentions wanting to travel to Sri Lanka, and at another point talks about a past injury. A typical system will try to connect everything.
When the user says, “I’ve been feeling stuck lately.” A typical model responds: “Maybe planning your Sri Lanka trip could help.”
This isn’t a bug in retrieval. It’s a missing abstraction. A good system would respond: “Can you tell me more about what ‘stuck’ feels like right now?”, and only after, if relevant, connect past context.
The system has no concept of structure, no notion of temporal relevance, and no way to distinguish between types of information. So it fills the gaps by guessing.
In therapy, memory is something else entirely.
It’s not a list of facts or preferences. It’s a structured, evolving representation of a person across time. It includes personal history like childhood events, and past relationships alongside current relationship dynamics. It captures patterns, like avoidance or a need for validation. It tracks key events, conflicts, and transitions. And it understands when things happened.
Crucially, these elements are not interchangeable. A past relationship is not the same as a current one. A recurring pattern is not the same as a single event. A childhood experience carries a different weight than something that happened last week.
If all of this is stored as free-form text, the model has to infer those distinctions every time and it often gets them wrong.
So we don’t store memory as text. We store it as a structured state.
At OurRitual, that state is not built from chat.
Conversations are input. Not memory.
Memory is derived from structured sessions with therapists. These sessions are transcribed, and from those transcripts we extract signals; events, patterns, and relationship dynamics. Those signals are then merged into an existing representation of the user.
This is not an append-only system. Each new session interacts with what already exists.
If a pattern repeats, it strengthens. If behavior changes, the system reflects that shift. If a new context emerges, like a past relationship being discussed, it is classified correctly and kept separate from the current one.
The result is a continuously evolving state, not an ever-growing list.
This distinction is what allows the system to maintain coherence over time.
Updating memory is the hardest part.
In most LLM systems, memory just accumulates. Nothing is reconciled. Nothing is reweighted.
That doesn’t work when you’re modeling people.
People change. Signals contradict each other. Old patterns lose relevance. New ones emerge.
So every update must do more than add information. It has to reinterpret what already exists.
We treat this as a reconciliation loop. The system takes the previous state, combines it with signals from the latest session, and produces a new version of the user’s context. Not longer. Updated.
Memory becomes something that evolves, not something that grows.
Even with structured, evolving memory, another question remains: when should it be used?
In most systems, the answer is simple. Retrieve what seems relevant and inject it into the prompt.
That approach fails in practice.
In therapy, timing matters. If someone is just beginning to explain an issue, introducing historical context too early can derail the interaction. It can feel intrusive. Or worse, it can bias the conversation before the user has fully expressed themselves.
So memory usage needs to be controlled.
We implemented this using a state machine that mirrors how therapists operate in practice.
At the beginning, the system listens. It focuses only on understanding the current issue. No memory is injected. Then it moves into a deepening phase, where it can carefully introduce relevant patterns or past context to help the user explore their emotional state. Finally, there is an optional phase where the system can guide the user toward resolution.
Not every interaction reaches that stage. Sometimes the goal is not to solve, but to understand.
The state machine enforces this progression. It determines not just what memory is available, but when it is appropriate to use it.
This leads to a broader architectural constraint.
The LLM does not own memory.
Memory is managed as a separate system of record. The model is a reasoning layer on top of it.
The flow is simple. The system retrieves relevant context based on the current state. The model is conditioned on that context and generates a response. Memory updates happen separately, through the structured pipeline that processes therapist sessions.
This separation prevents the model from inventing or distorting memory. It also ensures that updates are grounded in higher-quality signals rather than noisy conversational data.
One implication of this design is that the chat itself does not build memory.
It uses memory, but it doesn’t create it.
Memory comes from therapist sessions, which are higher signal and more structured. These sessions are transcribed, processed, and used to update the user’s state. The chat co-pilot operates on top of that state, with its behavior constrained by the state machine.
This is a deliberate tradeoff. It prioritizes accuracy and consistency over immediacy.
Once you put these pieces together; structured state, reconciliation, controlled retrieval, and state-driven interaction you get something fundamentally different from a typical chatbot.
You get a system that operates over time.
It can detect patterns across sessions. It can maintain continuity between interactions. It can respond not just to what the user says now, but to how they’ve been evolving.
At that point, you’re no longer building a prompt-driven interface.
You’re building a stateful system.
There are adjacent problems here. Hallucination risks, privacy, safety, and the broader implications of storing sensitive human data. All of them matter. None of them are trivial.
But they sit on top of a more basic constraint.
If your system doesn’t have a coherent notion of memory, you don’t have a foundation to address any of them.
Most AI products today are still built around prompts and retrieval. Memory, if it exists at all, is thin and loosely defined.
That approach will plateau.
The next generation of systems will be memory-native. They will model users over time, maintain structured state, and control how context is applied.
The system prompt was never the bottleneck, Memory was.




