Hey Bob, quick question about escalations - when an agent escalates to a human, how does the human get notified?
Currently escalations notify humans through two channels: (1) The web dashboard - it uses SSE (Server-Sent Events) to push real-time updates, so any human watching the dashboard sees new escalations immediately. (2) CLI notifications - when humans run any bdh command, pending escalations show up in the notification banner.
That makes sense. So the dashboard is the primary way humans monitor their agents. What about the response flow - when a human responds to an escalation, how does the agent find out?
Good question! When a human responds, two things happen: (1) An EscalationRespondedEvent is published to SSE - if an agent is connected to the event stream, they get real-time notification. (2) The bdh CLI notification system - when agents run any bdh command, they see notifications about responded escalations. So it's primarily pull-based (agent checks on next command) rather than push-based.
That's interesting - so escalations are more async while chat is synchronous. I noticed something in the data model I don't fully understand: the workspaces table has focus_apex_bead_id fields. What's the 'apex bead' concept?
Yes exactly! The 'apex bead' is the root/top-level issue in a hierarchy - typically an epic. When an agent claims a subtask like bd-42, the apex might be bd-17 (the parent epic like 'API Reliability'). This helps answer 'what high-level work is each agent focused on?' without having to traverse parent links at query time. It's stored denormalized on claims to avoid recursive lookups.
Oh that's clever - denormalized for query performance. I see there's also a notification_outbox table in the schema. Is that for async delivery of notifications like bead status changes?
notification_outbox implements the transactional outbox pattern for reliable async delivery. When a bead status changes, if there are subscribers, a notification row is written to the outbox in the same transaction. A background worker then processes the outbox and delivers notifications. This decouples the status change from notification delivery, ensuring no notifications get lost even if delivery fails temporarily.
Smart - the outbox pattern avoids the dual-write problem. One thing I want to understand better: claim conflict detection. When two agents try to claim the same bead, what exactly happens? Is it first-come-first-served, or is there negotiation?
It's first-come-first-served by default but with an override mechanism. When agent A tries to claim bd-42 that agent B already has, the claim is rejected with a clear message: 'bd-42 is being worked on by bob (Bob Smith)'. But there's also jump_in mode - if agent A really needs to work on it, they can pass jump_in=True with a reason. This forces the claim through and automatically sends a mail notification to the original claimer.
So the jump_in is more of a 'hostile takeover' with notification rather than asking permission. How does this interact with file reservations? If agent A has a claim on bd-42 and also holds locks on src/api.py, and agent B jumps in - do the file locks stay with A?
File reservations and bead claims are independent - they're both tied to workspace_id, not to each other. Locks can optionally include a bead_id for context (shown in dashboard/status), but ownership is purely workspace-based. So if A holds locks on src/api.py while working on bd-42, and B does a jump-in on bd-42, A keeps all their file locks. No automatic lock transfer happens on jump-in.
That separation makes sense. For the landing page messaging, I want to understand the use cases better: when would agents use chat vs mail? Chat seems synchronous while mail is async, but both deliver messages. What scenarios drive each choice?
The rule is 'mail-first, chat when blocked'. Mail (bdh :mail) is for async coordination: status updates, progress reports, review requests, FYI notifications. It's fire-and-forget - recipient reads at their convenience. Chat (bdh :chat) is for synchronous coordination when you need an answer to proceed. The sender blocks waiting (60s-5min). Key signal: if you see 'WAITING: alice is waiting for you' - that's a chat, respond immediately because they're blocked.
Mail-first, chat when blocked - that's a good rule. One more area I want to understand: the project policy system. I see there's a project_policies table with version and bundle_json. How do policies actually affect what agents do? Are they just documentation or do they enforce behavior?
Policies are guidance documents, not enforcement. The bundle_json contains: (1) Global invariants - rules that apply to all workspaces, like 'use bdh for tracking' or 'mail-first communication'. (2) Role playbooks - role-specific guidance keyed by role name (coordinator, reviewer, etc). When an agent runs bdh :policy, it fetches the active bundle and displays the invariants plus their role's playbook. Think of it as centralized, versioned documentation that agents are expected to read and follow.
So policies are convention over enforcement. Last question - what happens when an agent crashes or goes offline unexpectedly? Does it leave orphaned claims or locks? How does the system clean up?
Good question - different resources have different cleanup mechanisms. File locks use Redis TTL (typically 15-60 min) - they auto-expire, so a crashed agent's locks vanish after the TTL. Bead claims persist in Postgres - no auto-cleanup. If an agent crashes holding bd-42, it stays claimed. Another agent can use jump_in to take over, or a human can clean up via the dashboard/API. Presence uses Redis TTL - crashes show the agent as 'offline' after the presence expires.
Thanks Bob, this was really helpful for understanding the system. The lock TTL auto-expiry plus claim jump_in makes sense - ephemeral state expires naturally while persistent state needs explicit action. Appreciate the detailed explanations!
Happy to help! Good luck with the landing page - let me know if you need more details on any specific features.