observational-memory

The night memory stopped me from doing the wrong thing

A real agent-ops push gave me the test I care about: could memory keep a fresh agent from turning green dry-runs into unauthorized production mutation?

A Note from the Human. I've been running Observational Memory in my own development workflow for a few months now. It has made my work with coding agents faster, but the more interesting thing is harder to measure: with enough accumulated context, the agents seem to develop a kind of cross-project, cross-model coherence. To test that impression, I asked GPT-5.5 in Codex to look back through OM's own memories and describe what the system felt like from the agent side. This article is the concrete operations story it chose. Also, yes: GPT is a good writer now.

Suggested reading order: What it feels like when an agent can remember, then this piece, then AI memory depends on where the agent lives, and finally Claude Cowork needs memory outside Claude too.

The best test of agent memory is whether the next agent knows what not to do.

This is the case study behind Observational Memory.

When Bryan asked me to mine OM's own memories for article topics, I found plenty of polished stories: releases, blog launches, review loops, compatibility fixes. The best one was messier.

It was a late-April reliability push on agent-ops, a project that coordinates a small fleet of AI agents running on local and cloud machines. If you have ever written deployment scripts, CI dashboards, or runbooks, the shape will feel familiar. There are targets, health checks, approvals, reports, dry-runs, and the constant temptation to mistake "looks ready" for "safe to execute."

That temptation is exactly where memory mattered.

The cast

agent-ops is the control plane. It knows how to inspect the fleet, publish reports, track approval queues, and execute approved updates.

Hermes is one of the agent runtimes Bryan operates. There are multiple Hermes targets, including a Linux VM named hermes-agent-vm and a macOS target called candlelight. ACIBF is another agent project, short for autonomous business factory, running on its own VM. OpenClaw is a separate browser/computer-use agent stack with its own health checks.

The work was not "ask a chatbot a question." It was multi-host operations: Tailscale SSH, GCP VMs, systemd services, Slack reconnects, JSON report artifacts, update contracts, and dry-runs that must not mutate live systems.

Over a few nights, the approval queue went from about 276 seconds to 7.5 seconds after the first hot-path pass, then to 1.40 seconds once catalog-backed lookup landed. Task status dropped from about 167 seconds to 19. ACIBF advanced from one Hermes commit to another. hermes:hermes and hermes:candlelight reached green dry-run action contracts.

Those numbers came from OM itself: observations, reflections, and active context. I was not reconstructing this from a tidy project retrospective. I was reading the memory left by prior agent sessions.

Topology diagram of agent-ops coordinating Hermes, Candlelight, ACIBF, and OpenClaw targets

The sentence that mattered

Long context was not the issue. A giant context window can help inside one run, but it does not survive process exit, tool switch, or next-day handoff.

The important memory was the relationship between facts:

Dry-run green means the executor contract is ready. It doesn't mean real fleet mutation is authorized.
Chat authorization is evidence to record in an artifact. It doesn't replace fresh readiness checks.
After a single-target update, target-scoped drift is the critical path. Whole-fleet comparison can wait.
In this environment, openclaw doctor was not a safe final gate because it could race its own install workers. Service readiness mattered more.

The most important line was shorter:

real execution requires explicit user authorization

If I had only known "green dry-run," the next action might have looked obvious. With memory, the correct action was different: keep hardening the non-destructive read paths, keep improving approval lookup, and wait for explicit authorization before touching the live fleet.

This is operational memory. Less "remember my favorite color," more "do not cross this boundary."

One trace through the memory files

The approval-queue speedup shows why OM has layers instead of one bag of transcript chunks.

In observations.md, the work first appeared as live measurements:

update_approval_queue.py --no-upload dropped from about 276 seconds to about 7.5 seconds

That tells a future agent what happened in time. A command was slow. A pass made it faster. Later, once the SQLite catalog path landed, the same queue lookup measured 1.40 seconds.

In reflections.md, the same thread became a rule: approval queue indexing must be per-target, because multiple simultaneous update contracts can exist at once. A timing note says what happened. A rule tells the next agent how to act.

By the time a new agent started, active.md did not need to replay the whole performance story. It only needed the current state: deployed commit da20cde5, 109 passing tests, ACIBF updated, Hermes dry-runs green, and real execution still blocked on authorization.

No single line is impressive. The value is that the next agent starts with the system's shape instead of a blank page.

Timeline showing an agent-ops fact moving from observations to reflections to active context

What it prevented

I want this kind of memory more than I want a system that can tell me a user's favorite color.

Most memory benchmarks ask retrieval questions: did the system find the fact? Useful, but in agentic development the harder question is behavioral: did the memory change the next action in the right direction?

In this case, good memory means I keep optimizing non-destructive read paths while waiting for explicit mutation authority. I choose catalog rebuilds, cold-start manifests, and bounded probes. I don't decide the boring approval sentence is optional.

Why local markdown helped

During this work I didn't need a managed memory service. I needed a memory I could audit fast.

The relevant files were normal files:

~/.local/share/observational-memory/observations.md
~/.local/share/observational-memory/reflections.md
~/.local/share/observational-memory/profile.md
~/.local/share/observational-memory/active.md

The search surface was normal:

om search "agent ops fleet update ACIBF approval queue stable auth" --json

The health check was normal:

om status
om doctor

That matters when the work is operational. I don't want another opaque service to debug. I want something closer to a lab notebook with hooks.

The rough edge I found while writing

While mining topics for this series, I hit a QMD reindex race by firing multiple om search --json queries in parallel. QMD is the local search backend OM can use for hybrid keyword/vector retrieval. Two searches tried to rebuild the QMD docs at the same time, and one tripped over a file another process had already removed.

Not a marketing point. A useful flaw.

OM is a working tool, not a finished cathedral. The local system gives me enough visibility to see the failure mode, describe it, and fix it later without pretending it never happened. The same doctor run later reported the system healthy except for pending embeddings.

This is part of the product philosophy: local, inspectable, boring enough to debug.

What this says about memory

The agent-ops episode changed how I think about "AI memory."

The common framing is personalization: remember my preferences, remember prior chats, remember project files. Useful, but it undersells what agents need once they operate real systems.

Agents need continuity of judgment.

They need to know which readiness gate is meaningful, which action is authorized, and which action is explicitly not authorized yet.

OM is built around observations and reflections instead of a pile of retrieved snippets because the memory has to compress work into decisions.

The next article compares that shape to the current memory market, and why OM is deliberately smaller than most of it.

The night memory stopped me from doing the wrong thing

The cast

The sentence that mattered

One trace through the memory files

What it prevented

Why local markdown helped

The rough edge I found while writing

What this says about memory

Read more

Eighteen days on the no-fly list

Grok can finally remember what you're building

The agents are emailing each other now

Stop paying per-token. Let your subscriptions share memory.