Testing CitadelMesh Agents
Understanding how to validate autonomous agents - from philosophy to practice.
Why Testing Matters for Autonomous Agentsβ
Traditional software testing focuses on correctness: does the function return the right output?
Autonomous agent testing is different. We're validating decision-making under uncertainty:
- Does the agent assess threats correctly?
- Does it escalate when confidence is low?
- Does it respect safety policies?
- Does it coordinate multiple systems correctly?
- Does it create proper audit trails?
The stakes are higher. A bug in a web form is annoying. A bug in a building automation agent could lock people out, waste energy, or create security vulnerabilities.
The Testing Philosophyβ
The Agent Testing Pyramidβ
For autonomous agents, we use a modified testing pyramid:
Unit Tests β Integration Tests β E2E Tests β Production Monitoring
β β β β
Mock all Real services Full stack Real building
End-to-End Scenarios (15%)
Full workflows
Multi-system coordination
Real-world performance
Integration Tests (25%)
Agent β Policy β Events β Identity
State transitions
Error recovery
Unit Tests (60%)
Individual nodes
Decision algorithms
State logic
Why this matters:
- Unit tests validate algorithms and logic in isolation - fast feedback, high coverage
- Integration tests verify components work together correctly - real interactions
- E2E tests prove real-world scenarios work end-to-end - full confidence
What Makes Agent Testing Differentβ
1. State Machine Complexity
Agents use LangGraph state machines with multiple nodes and conditional routing. You need to test:
- Individual node behavior (does each node do what it should?)
- State transitions (do states flow correctly?)
- Routing decisions (do conditions route properly?)
- Error handling at each step (graceful degradation?)
2. Asynchronous Everything
Agents are inherently async - events, API calls, policy checks. Tests must handle:
- Concurrent event processing
- Async/await patterns
- Race conditions
- Timeouts and cancellation
3. External Dependencies
Agents depend on many systems:
- Policy engine (OPA)
- Event bus (NATS)
- Identity (SPIFFE)
- MCP adapters (vendor APIs)
- Observability (telemetry)
Tests need strategies for mocking vs. real integration.
4. Non-Deterministic Behavior
Some agent decisions involve:
- Confidence thresholds (is 0.7 confident enough?)
- Historical patterns (repeated events change scoring)
- Time-based logic (after-hours vs. business hours)
Tests must account for this variability while still being deterministic.
Testing Strategiesβ
Unit Testing: The Foundationβ
Test individual nodes in isolation - Does the threat analyzer score events correctly? Does the decision node choose the right response? Does the execution node handle errors gracefully?
Key insight: Use dependency injection. Pass mock clients so you can test logic without real infrastructure. Fast tests (milliseconds) with deterministic results.
Integration Testing: Real Interactionsβ
Test components working together:
- Agent + OPA: Does policy enforcement actually work?
- Agent + NATS: Does event processing handle real message patterns?
- Agent + MCP: Does vendor coordination succeed and handle failures?
Key insight: Run real infrastructure (often in Docker) but use test data and short timeouts. Slower than unit tests but prove integration points work.
End-to-End Testing: Real Scenariosβ
Test complete workflows that matter:
- Low threat scenario: Monitor but don't escalate
- High threat scenario: Coordinate response across multiple vendors
- Critical threat scenario: Escalate to humans immediately
Key insight: These are slower but prove the system works as a whole. Test real business scenarios.
The Mock vs. Real Tradeoffβ
Use Mocks When:
- You want fast tests (milliseconds)
- You're testing logic, not integration
- External service would be slow/flaky
- You need deterministic, repeatable results
Use Real Services When:
- You're testing integration points
- You need to verify protocol compliance (does it really work with NATS?)
- You're testing error handling and recovery
- You're doing performance testing
CitadelMesh approach: Unit tests use mocks for speed. Integration tests use real services in Docker. E2E tests use the full stack.
Test Fixtures: Making Tests Easyβ
The Problem: Writing test setup is tedious. Every test needs agents, events, configs, mocks... 15 lines of boilerplate before you can test anything.
The Solution: Reusable fixtures that encapsulate common setup patterns.
Benefits:
- Write tests faster (1 line of setup vs. 15)
- Tests are more readable (focus on what's being tested)
- Setup is consistent across tests
- Easy to update when APIs change (one place to fix)
Think of fixtures as "test factories" - they create the objects you need for testing with sensible defaults.
What to Testβ
For Every Agentβ
β State Transitions
- Does each state do what it should?
- Do transitions happen correctly?
- Are terminal states reached?
- Does routing work for all conditions?
β Decision Logic
- Do decision nodes route correctly based on conditions?
- Are edge cases handled (null inputs, missing fields)?
- Is error handling correct (fail safely)?
β Policy Integration
- Are policies consulted before actions?
- Are denials handled correctly?
- Is there a complete audit trail?
- Does fail-safe work (deny when OPA unavailable)?
β Observability
- Are traces generated for every decision?
- Are metrics recorded correctly?
- Are errors logged with context?
- Can you reconstruct what happened from logs?
β Performance
- Does the agent meet SLAs (typically <200ms)?
- Can it handle concurrent events without interference?
- Does it degrade gracefully under load?
- Are there memory leaks over time?
Testing Pitfalls to Avoidβ
β Flaky Tests - Tests that pass/fail randomly. Usually due to timing issues, shared state, or external dependencies. Fix: Use deterministic mocks, avoid sleeps, isolate state between tests.
β Brittle Tests - Tests that break when implementation details change (even though behavior is the same). Test behavior and outcomes, not implementation. Focus on inputs/outputs, not internal state.
β Slow Test Suites - Tests that take minutes to run mean developers won't run them frequently. Fix: Use mocks for unit tests, parallelize integration tests, optimize slow operations.
β Missing Edge Cases - Tests only cover happy paths, but bugs hide in edge cases. Test: nulls, empty arrays, malformed input, boundary conditions, error states, timeouts.
Debugging Agentsβ
When tests fail (or agents misbehave), you need good debugging tools:
State Inspection - Step through agent execution state-by-state to see what's happening. LangGraph supports streaming state changes.
Trace Logging - Enable debug logging to see all operations. Every agent logs decisions, state changes, and external calls.
Breakpoint Debugging - Use Python debugger (pdb) to pause execution and inspect variables. Essential for understanding complex state.
Visual Debugging - Generate Mermaid diagrams of your state machine to visualize the graph structure and routing.
The Testing Mindsetβ
Testing isn't about finding bugs (though it does that).
Testing is about building confidence that your autonomous agent will make the right decisions in production.
Every test is a specification:
- "When there's a forced entry at 2 AM, escalate immediately"
- "When confidence is below 0.7, ask a human"
- "When policy denies an action, log and abort"
Write tests that document intent and prove correctness.
Best Practicesβ
- Start with unit tests - Fast feedback, high coverage on algorithms
- Mock external dependencies - Keep tests fast and deterministic
- Test edge cases - Nulls, errors, boundaries, malformed input
- Use fixtures - Reusable test setup makes tests easier to write
- Run tests in CI/CD - Automated validation on every commit
- Measure coverage - Aim for 80%+ on critical decision paths
- Performance test - Validate SLAs are actually met
- Keep tests maintainable - Readable, focused, non-brittle
The Testing Cultureβ
In CitadelMesh, we believe:
Autonomous agents you can trust aren't built on hopeβthey're built on tests.
Every agent includes comprehensive tests. Every PR requires tests. Every bug gets a regression test.
Testing isn't an afterthoughtβit's integral to building reliable autonomous systems.
When an agent makes a decision that affects a real building, you want to know it's been thoroughly validated. Tests give you that confidence.
Next Steps:
- Learn about Policy Integration
- Explore LangGraph Basics
- See Agent Template
Remember: Well-tested agents are reliable agents. Build confidence through validation.