0017. Observability: structured logging and a persistent event store¶
- Status: accepted
- Date: 2026-06-05
Context and Problem Statement¶
kenny has an observability gap. Both components emit useful events but nothing is central or persistent:
- The server uses Python
loggingwith no central configuration, so it defaults toWARNINGon stderr without timestamps, and only a handful of call sites log at all. The tool-call audit (CallLog) is an in-memorydeque(200)lost on restart. - The agent logs via
tracingto stderr only. When it runs as a Windows service (ADR-0013, LocalSystem, no attached terminal) all of that output is discarded, so there is no record of reconnects, tool dispatch, or errors on the endpoint.
We want operator-visible events — from both the server and every agent — to be configurable, durable, and viewable fleet-wide in the dashboard.
Considered Options¶
- Forward agent logs over the existing tunnel + one persistent event store on the server, plus a local rotating file on the agent and central server log config.
- Local files only on each side, aggregated out-of-band (e.g. ship logs with a separate agent/collector).
- An external log stack (e.g. Loki/ELK) the server and agents push to.
Decision Outcome¶
Chosen option: forward agent logs over the existing tunnel and persist everything in SQLite, because it reuses the contract-first wire and the low-ops SQLite storage kenny already runs (ADR-0007), and keeps a family-scale fleet dependency-light. Concretely:
- A new additive
logframe (agent → server), one event per frame, for events at or aboveKENNY_LOG_FORWARD_LEVEL(defaultinfo). The agent also writes a fuller record (>= debug/RUST_LOG) to a local rotating file so the deep record survives offline. Forwarding is best-effort: a process-global bounded ring buffer decouples thetracinglayer from the per-session tunnel channel, so events produced while disconnected accumulate (oldest dropped under pressure) and flush on reconnect. - A single
eventstable in SQLite holds server log records, forwarded agent log events, and the tool-call audit, discriminated bykind(log|audit) andsource(server|agent), with the same ~30-day retention as snapshots. - The server gains a central
configure_logging()(level viaKENNY_LOG_LEVEL, timestamped format, uvicorn loggers wired) and alogging.Handlerthat bridges the sync logging API to async SQLite via a boundedasyncio.Queuedrained by a background task — no synchronous SQLite writes on the event loop. - The dashboard gains a fleet-wide events/logs panel backed by a new
/api/events.
Consequences¶
- Good, because endpoint events (reconnects, tool dispatch, errors) are no longer lost in service mode, and the operator sees server + agent events in one place.
- Good, because retention and storage reuse the existing SQLite/30-day pattern; no new infrastructure to run.
- Bad, because the agent now carries a small always-on forwarding path and the server a small write path; both are bounded/drop-on-pressure to protect the hot paths.
- Neutral, because the
tracingforwarding layer must filter its own and the tunnel writer's targets to avoid a feedback loop — an explicit, tested constraint.
More Information¶
Frame shape: docs/protocol.md § log and docs/fixtures/log.json
(PROTOCOL_VERSION 0.4). Builds on ADR-0007 (push model + SQLite) and is motivated by
ADR-0013 (Windows service — service-mode stderr is discarded).