Skip to main content

Testing CitadelMesh Agents

Understanding how to validate autonomous agents - from philosophy to practice.

Why Testing Matters for Autonomous Agents​

Traditional software testing focuses on correctness: does the function return the right output?

Autonomous agent testing is different. We're validating decision-making under uncertainty:

  • Does the agent assess threats correctly?
  • Does it escalate when confidence is low?
  • Does it respect safety policies?
  • Does it coordinate multiple systems correctly?
  • Does it create proper audit trails?

The stakes are higher. A bug in a web form is annoying. A bug in a building automation agent could lock people out, waste energy, or create security vulnerabilities.

The Testing Philosophy​

The Agent Testing Pyramid​

For autonomous agents, we use a modified testing pyramid:

Unit Tests β†’ Integration Tests β†’ E2E Tests β†’ Production Monitoring
↓ ↓ ↓ ↓
Mock all Real services Full stack Real building
           End-to-End Scenarios (15%)
Full workflows
Multi-system coordination
Real-world performance

Integration Tests (25%)
Agent ↔ Policy ↔ Events ↔ Identity
State transitions
Error recovery

Unit Tests (60%)
Individual nodes
Decision algorithms
State logic

Why this matters:

  • Unit tests validate algorithms and logic in isolation - fast feedback, high coverage
  • Integration tests verify components work together correctly - real interactions
  • E2E tests prove real-world scenarios work end-to-end - full confidence

What Makes Agent Testing Different​

1. State Machine Complexity

Agents use LangGraph state machines with multiple nodes and conditional routing. You need to test:

  • Individual node behavior (does each node do what it should?)
  • State transitions (do states flow correctly?)
  • Routing decisions (do conditions route properly?)
  • Error handling at each step (graceful degradation?)

2. Asynchronous Everything

Agents are inherently async - events, API calls, policy checks. Tests must handle:

  • Concurrent event processing
  • Async/await patterns
  • Race conditions
  • Timeouts and cancellation

3. External Dependencies

Agents depend on many systems:

  • Policy engine (OPA)
  • Event bus (NATS)
  • Identity (SPIFFE)
  • MCP adapters (vendor APIs)
  • Observability (telemetry)

Tests need strategies for mocking vs. real integration.

4. Non-Deterministic Behavior

Some agent decisions involve:

  • Confidence thresholds (is 0.7 confident enough?)
  • Historical patterns (repeated events change scoring)
  • Time-based logic (after-hours vs. business hours)

Tests must account for this variability while still being deterministic.

Testing Strategies​

Unit Testing: The Foundation​

Test individual nodes in isolation - Does the threat analyzer score events correctly? Does the decision node choose the right response? Does the execution node handle errors gracefully?

Key insight: Use dependency injection. Pass mock clients so you can test logic without real infrastructure. Fast tests (milliseconds) with deterministic results.

Integration Testing: Real Interactions​

Test components working together:

  • Agent + OPA: Does policy enforcement actually work?
  • Agent + NATS: Does event processing handle real message patterns?
  • Agent + MCP: Does vendor coordination succeed and handle failures?

Key insight: Run real infrastructure (often in Docker) but use test data and short timeouts. Slower than unit tests but prove integration points work.

End-to-End Testing: Real Scenarios​

Test complete workflows that matter:

  • Low threat scenario: Monitor but don't escalate
  • High threat scenario: Coordinate response across multiple vendors
  • Critical threat scenario: Escalate to humans immediately

Key insight: These are slower but prove the system works as a whole. Test real business scenarios.

The Mock vs. Real Tradeoff​

Use Mocks When:

  • You want fast tests (milliseconds)
  • You're testing logic, not integration
  • External service would be slow/flaky
  • You need deterministic, repeatable results

Use Real Services When:

  • You're testing integration points
  • You need to verify protocol compliance (does it really work with NATS?)
  • You're testing error handling and recovery
  • You're doing performance testing

CitadelMesh approach: Unit tests use mocks for speed. Integration tests use real services in Docker. E2E tests use the full stack.

Test Fixtures: Making Tests Easy​

The Problem: Writing test setup is tedious. Every test needs agents, events, configs, mocks... 15 lines of boilerplate before you can test anything.

The Solution: Reusable fixtures that encapsulate common setup patterns.

Benefits:

  • Write tests faster (1 line of setup vs. 15)
  • Tests are more readable (focus on what's being tested)
  • Setup is consistent across tests
  • Easy to update when APIs change (one place to fix)

Think of fixtures as "test factories" - they create the objects you need for testing with sensible defaults.

What to Test​

For Every Agent​

βœ… State Transitions

  • Does each state do what it should?
  • Do transitions happen correctly?
  • Are terminal states reached?
  • Does routing work for all conditions?

βœ… Decision Logic

  • Do decision nodes route correctly based on conditions?
  • Are edge cases handled (null inputs, missing fields)?
  • Is error handling correct (fail safely)?

βœ… Policy Integration

  • Are policies consulted before actions?
  • Are denials handled correctly?
  • Is there a complete audit trail?
  • Does fail-safe work (deny when OPA unavailable)?

βœ… Observability

  • Are traces generated for every decision?
  • Are metrics recorded correctly?
  • Are errors logged with context?
  • Can you reconstruct what happened from logs?

βœ… Performance

  • Does the agent meet SLAs (typically <200ms)?
  • Can it handle concurrent events without interference?
  • Does it degrade gracefully under load?
  • Are there memory leaks over time?

Testing Pitfalls to Avoid​

❌ Flaky Tests - Tests that pass/fail randomly. Usually due to timing issues, shared state, or external dependencies. Fix: Use deterministic mocks, avoid sleeps, isolate state between tests.

❌ Brittle Tests - Tests that break when implementation details change (even though behavior is the same). Test behavior and outcomes, not implementation. Focus on inputs/outputs, not internal state.

❌ Slow Test Suites - Tests that take minutes to run mean developers won't run them frequently. Fix: Use mocks for unit tests, parallelize integration tests, optimize slow operations.

❌ Missing Edge Cases - Tests only cover happy paths, but bugs hide in edge cases. Test: nulls, empty arrays, malformed input, boundary conditions, error states, timeouts.

Debugging Agents​

When tests fail (or agents misbehave), you need good debugging tools:

State Inspection - Step through agent execution state-by-state to see what's happening. LangGraph supports streaming state changes.

Trace Logging - Enable debug logging to see all operations. Every agent logs decisions, state changes, and external calls.

Breakpoint Debugging - Use Python debugger (pdb) to pause execution and inspect variables. Essential for understanding complex state.

Visual Debugging - Generate Mermaid diagrams of your state machine to visualize the graph structure and routing.

The Testing Mindset​

Testing isn't about finding bugs (though it does that).

Testing is about building confidence that your autonomous agent will make the right decisions in production.

Every test is a specification:

  • "When there's a forced entry at 2 AM, escalate immediately"
  • "When confidence is below 0.7, ask a human"
  • "When policy denies an action, log and abort"

Write tests that document intent and prove correctness.

Best Practices​

  1. Start with unit tests - Fast feedback, high coverage on algorithms
  2. Mock external dependencies - Keep tests fast and deterministic
  3. Test edge cases - Nulls, errors, boundaries, malformed input
  4. Use fixtures - Reusable test setup makes tests easier to write
  5. Run tests in CI/CD - Automated validation on every commit
  6. Measure coverage - Aim for 80%+ on critical decision paths
  7. Performance test - Validate SLAs are actually met
  8. Keep tests maintainable - Readable, focused, non-brittle

The Testing Culture​

In CitadelMesh, we believe:

Autonomous agents you can trust aren't built on hopeβ€”they're built on tests.

Every agent includes comprehensive tests. Every PR requires tests. Every bug gets a regression test.

Testing isn't an afterthoughtβ€”it's integral to building reliable autonomous systems.

When an agent makes a decision that affects a real building, you want to know it's been thoroughly validated. Tests give you that confidence.


Next Steps:


Remember: Well-tested agents are reliable agents. Build confidence through validation.