Architecture Overview
CitadelMesh is a polyglot, zero-trust multi-agent platform that enables autonomous building operations while maintaining strict safety guarantees. This document provides a high-level view of the system design, component relationships, and key architectural decisions.
System Goals
CitadelMesh is designed around four core principles:
- Autonomy with Safety: Closed-loop control with hard constraints, OPA-enforced policies, and human approval gates for high-impact actions
- Interoperability: Heterogeneous building protocols normalized to canonical data models through vendor-agnostic abstractions
- Resilience: Edge-first deployment model with offline autonomy and graceful degradation
- Auditability: Deterministic agent execution with full replay capability and policy explanation traces
High-Level Architecture
graph TB
subgraph "Building Edge"
Agents[Agent Runtime<br/>LangGraph Agents]
Adapters[Protocol Adapters<br/>BACnet, OPC UA, REST]
Safety[Safety Layer<br/>OPA Policies]
EventBus[Event Bridge<br/>MQTT + NATS]
Store[Local Store<br/>TimescaleDB]
Agents --> Safety
Safety --> EventBus
Adapters --> EventBus
EventBus --> Store
end
subgraph "Vendor Systems"
SSE[Schneider Security Expert]
Avigilon[Avigilon CCTV]
EBO[EcoStruxure EBO]
HA[Home Assistant]
PME[Schneider PME]
end
subgraph "Cloud Control Plane"
Orchestrator[Agent Orchestrator]
Services[.NET Microservices<br/>Aspire + Orleans]
Twin[Digital Twin Service]
DataLake[Data Lake + Analytics]
Observability[OpenTelemetry<br/>Collector]
end
Adapters --> SSE
Adapters --> Avigilon
Adapters --> EBO
Adapters --> HA
Adapters --> PME
EventBus --> Orchestrator
EventBus --> Observability
Agents --> Services
Services --> Twin
EventBus --> DataLake
Logical Components
Edge Node (Per Building/Zone)
Each building or zone runs a self-contained edge deployment:
- Agent Runtime: LangGraph-based agents packaged as containers, with optional WASM plugins
- .NET Sidecars: Optional Dapr sidecars for pub/sub and bindings; Semantic Kernel-based skills for .NET integration
- Protocol Adapters: Unified adapters for BACnet/SC, OPC UA, KNX, Modbus, MQTT, and Matter
- Vendor Adapters: System-specific integrations for Schneider Security Expert, Avigilon, EcoStruxure (EBO), Home Assistant, Schneider PME, Bosch Fire/Intrusion
- Event Bridge: MQTT broker + NATS JetStream for reliable event delivery
- Safety Layer: OPA/Rego policy engine with rate limits and guardrails
- Local Stores: TimescaleDB (or SQLite + litefs) for telemetry; object cache for video/files
- UI Kiosk: Optional onsite control interface
- Grid/DER I/O: OpenADR client, IEC 61850 gateway, IEEE 2030.5/SEP2 client, OCPP for EVSE
Cloud Control Plane
Centralized orchestration and analytics:
- Orchestrator: Deploys agent graphs, manages versions, handles feature flags
- .NET Services: Aspire-composed microservices for scheduling, alarms, and long-lived sessions; Orleans actors for stateful workloads
- Knowledge Services: Vector store, RAG, digital twin graph database
- Observability: OpenTelemetry collector aggregating traces, metrics, and logs
- A2A/MCP Gateways: Cross-agent and tool interoperability layer
- Data Lake + TSDB: Long-term storage for analytics and ML training
Agent Topology
CitadelMesh uses a multi-agent architecture where specialized agents collaborate:
graph LR
Security[Security Agent] --> Ops[Ops Agent]
Energy[Energy Agent] --> Ops
Automation[Automation Agent] --> Ops
Twin[Twin Agent] --> Security
Twin --> Energy
Twin --> Automation
DER[DER/Grid Agent] --> Energy
Compliance[Compliance Agent] --> Ops
Agent Responsibilities
- Security Agent: Fuses camera analytics, access control, and intrusion sensors; executes security playbooks; escalates incidents
- Energy Agent: Optimizes HVAC and lighting based on tariffs, weather, and occupancy forecasts; implements safe RL controllers
- Automation Agent: Orchestrates scenes, schedules, and user intents; integrates with building management systems
- Ops Agent: Incident triage with human-in-the-loop; report generation; SLA tracking
- Twin Agent: Maintains digital twin state; reconciles vendor systems; emits derived KPIs
- DER/Grid Agent: Orchestrates batteries, solar, generators, and EVSE; handles demand response events
- Compliance Agent: Monitors controls against policies; produces compliance attestations
All agents communicate over the event bus using CloudEvents with signed JWTs and mTLS (SPIFFE identities). Cross-runtime RPC uses gRPC with protobuf contracts.
Data Flow
sequenceDiagram
participant Vendor as Vendor System
participant Adapter as Protocol Adapter
participant Bus as Event Bus
participant OPA as OPA Policy Engine
participant Agent as Agent
participant Twin as Digital Twin
Vendor->>Adapter: Telemetry/Event
Adapter->>Bus: CloudEvent(Protobuf)
Bus->>Twin: State Update
Bus->>Agent: Event Notification
Agent->>Agent: Process in LangGraph
Agent->>OPA: Validate Action
OPA-->>Agent: Allow/Deny + Reason
Agent->>Bus: Command(signed)
Bus->>Adapter: Execute Command
Adapter->>Vendor: Vendor-Specific Call
Vendor-->>Adapter: Result
Adapter->>Bus: CommandResult Event
Data Contract Standards
All data flowing through CitadelMesh follows strict contracts:
- Canonical Telemetry:
{ entity_id, metric, value, unit, timestamp, quality, attributes } - Control Command:
{ id, target_id, action, params, ttl_seconds, safety_token, issued_by } - Incident:
{ id, severity, signals[], hypothesis, actions[], owner }
Contracts are serialized with protobuf, versioned schemas, and evolved via CI checks.
Key Architectural Decisions
Protocol-First Design
We prioritize durable, vendor-neutral protocols over framework lock-in:
- CloudEvents 1.0 for universal event envelopes
- Protobuf for compact, versioned payloads
- gRPC for low-latency cross-runtime RPC
- MCP (Model Context Protocol) for tool server interoperability
This approach ensures we can swap agent frameworks, languages, or vendors without rewriting integration layers.
Edge-First Deployment
Buildings must operate autonomously when cloud connectivity is degraded:
- Local event brokers and databases continue operating
- Queued commands reconcile on reconnection
- Critical safety policies enforced at the edge
- Minimal viable operations via local dashboards
Zero-Trust Security Model
Every component assumes hostile networks:
- SPIFFE/SPIRE for workload identity with mTLS everywhere
- Least privilege tokens scoped to specific capabilities
- Policy enforcement at every boundary (OPA)
- Audit trails for all tool calls and actuation events
- Secret management via Vault or cloud KMS
Polyglot Runtime
Different workloads demand different languages:
- Python/LangGraph for fast agent iteration and rich AI ecosystem
- .NET/Aspire/Orleans for high-throughput stateful services
- TypeScript for frontend and Node.js tooling
- Dapr sidecars for language-agnostic building blocks
Cross-language RPC via gRPC keeps everything connected.
Safety-First Control
Autonomous operation requires multiple safety layers:
- Hard Constraints: OPA policies block invalid actions (e.g., temp bounds, egress locks)
- Shadow Mode: Evaluate new policies without actuation
- Multi-Step Approvals: Human gates for high-impact changes
- Circuit Breakers: Automatic rollback on anomaly detection
- Safety Tokens: Signed approvals required for critical actions
Performance Characteristics
- Event Latency: P99 < 100ms for telemetry processing
- Command Latency: P99 < 500ms for control commands
- Throughput: 10,000+ events/sec per edge node
- Storage: 90-day retention at edge; unlimited in cloud
- Offline Autonomy: Minimum 72 hours of full operations
Scalability
CitadelMesh is designed to scale from single buildings to enterprise portfolios:
- Horizontal Scaling: Add edge nodes per building/zone
- Agent Scaling: Deploy multiple instances of the same agent type
- Event Bus Scaling: NATS JetStream clustering for high throughput
- Database Scaling: PostgreSQL read replicas and TimescaleDB compression
- Cloud Scaling: Kubernetes HPA for .NET microservices
- Multi-Tenancy: Isolated namespaces and policy boundaries per tenant
Deployment Model
- Edge: K3s cluster on industrial PC; containers per agent; MQTT + NATS
- Cloud: Managed Kafka/Redpanda, Postgres/TimescaleDB, S3-compatible storage, Kubernetes
- CI/CD: GitOps (ArgoCD/Flux), signed container images (Sigstore), SBOMs
Observability Strategy
- Traces: OpenTelemetry from all agents and adapters
- Metrics: Prometheus-compatible metrics for SLIs/SLOs
- Logs: Structured JSON logs shipped to centralized aggregation
- Dashboards: Aspire dashboards for .NET services; Grafana for infrastructure
- Replay: Deterministic LangGraph execution replay for incident analysis
Related Documentation
- Protocol Strategy - Deep dive on CloudEvents, Protobuf, MCP, and A2A
- Safety Guardrails - OPA policies, shadow mode, approval workflows
- Identity Foundation - SPIFFE/SPIRE zero-trust architecture
- Agent Topology - LangGraph agent design and coordination
- Edge Architecture - K3s deployment and offline autonomy
- Observability - OpenTelemetry and monitoring stack