Testing CitadelMesh Agents

Understanding how to validate autonomous agents - from philosophy to practice.

Why Testing Matters for Autonomous Agents

Traditional software testing focuses on correctness: does the function return the right output?

Autonomous agent testing is different. We're validating decision-making under uncertainty:

Does the agent assess threats correctly?
Does it escalate when confidence is low?
Does it respect safety policies?
Does it coordinate multiple systems correctly?
Does it create proper audit trails?

The stakes are higher. A bug in a web form is annoying. A bug in a building automation agent could lock people out, waste energy, or create security vulnerabilities.

The Testing Philosophy

The Agent Testing Pyramid

For autonomous agents, we use a modified testing pyramid:

Unit Tests → Integration Tests → E2E Tests → Production Monitoring
    ↓              ↓                  ↓              ↓
  Mock all    Real services    Full stack    Real building

           End-to-End Scenarios (15%)
                 Full workflows
                 Multi-system coordination
                 Real-world performance

        Integration Tests (25%)
              Agent ↔ Policy ↔ Events ↔ Identity
              State transitions
              Error recovery

    Unit Tests (60%)
         Individual nodes
         Decision algorithms
         State logic

Why this matters:

Unit tests validate algorithms and logic in isolation - fast feedback, high coverage
Integration tests verify components work together correctly - real interactions
E2E tests prove real-world scenarios work end-to-end - full confidence

What Makes Agent Testing Different

1. State Machine Complexity

Agents use LangGraph state machines with multiple nodes and conditional routing. You need to test:

Individual node behavior (does each node do what it should?)
State transitions (do states flow correctly?)
Routing decisions (do conditions route properly?)
Error handling at each step (graceful degradation?)

2. Asynchronous Everything

Agents are inherently async - events, API calls, policy checks. Tests must handle:

Concurrent event processing
Async/await patterns
Race conditions
Timeouts and cancellation

3. External Dependencies

Agents depend on many systems:

Policy engine (OPA)
Event bus (NATS)
Identity (SPIFFE)
MCP adapters (vendor APIs)
Observability (telemetry)

Tests need strategies for mocking vs. real integration.

4. Non-Deterministic Behavior

Some agent decisions involve:

Confidence thresholds (is 0.7 confident enough?)
Historical patterns (repeated events change scoring)
Time-based logic (after-hours vs. business hours)

Tests must account for this variability while still being deterministic.

Testing Strategies

Unit Testing: The Foundation

Test individual nodes in isolation - Does the threat analyzer score events correctly? Does the decision node choose the right response? Does the execution node handle errors gracefully?

Key insight: Use dependency injection. Pass mock clients so you can test logic without real infrastructure. Fast tests (milliseconds) with deterministic results.

Integration Testing: Real Interactions

Test components working together:

Agent + OPA: Does policy enforcement actually work?
Agent + NATS: Does event processing handle real message patterns?
Agent + MCP: Does vendor coordination succeed and handle failures?

Key insight: Run real infrastructure (often in Docker) but use test data and short timeouts. Slower than unit tests but prove integration points work.

End-to-End Testing: Real Scenarios

Test complete workflows that matter:

Low threat scenario: Monitor but don't escalate
High threat scenario: Coordinate response across multiple vendors
Critical threat scenario: Escalate to humans immediately

Key insight: These are slower but prove the system works as a whole. Test real business scenarios.

The Mock vs. Real Tradeoff

Use Mocks When:

You want fast tests (milliseconds)
You're testing logic, not integration
External service would be slow/flaky
You need deterministic, repeatable results

Use Real Services When:

You're testing integration points
You need to verify protocol compliance (does it really work with NATS?)
You're testing error handling and recovery
You're doing performance testing

CitadelMesh approach: Unit tests use mocks for speed. Integration tests use real services in Docker. E2E tests use the full stack.

Test Fixtures: Making Tests Easy

The Problem: Writing test setup is tedious. Every test needs agents, events, configs, mocks... 15 lines of boilerplate before you can test anything.

The Solution: Reusable fixtures that encapsulate common setup patterns.

Benefits:

Write tests faster (1 line of setup vs. 15)
Tests are more readable (focus on what's being tested)
Setup is consistent across tests
Easy to update when APIs change (one place to fix)

Think of fixtures as "test factories" - they create the objects you need for testing with sensible defaults.

What to Test

For Every Agent

✅ State Transitions

Does each state do what it should?
Do transitions happen correctly?
Are terminal states reached?
Does routing work for all conditions?

✅ Decision Logic

Do decision nodes route correctly based on conditions?
Are edge cases handled (null inputs, missing fields)?
Is error handling correct (fail safely)?

✅ Policy Integration

Are policies consulted before actions?
Are denials handled correctly?
Is there a complete audit trail?
Does fail-safe work (deny when OPA unavailable)?

✅ Observability

Are traces generated for every decision?
Are metrics recorded correctly?
Are errors logged with context?
Can you reconstruct what happened from logs?

✅ Performance

Does the agent meet SLAs (typically <200ms)?
Can it handle concurrent events without interference?
Does it degrade gracefully under load?
Are there memory leaks over time?

Testing Pitfalls to Avoid

❌ Flaky Tests - Tests that pass/fail randomly. Usually due to timing issues, shared state, or external dependencies. Fix: Use deterministic mocks, avoid sleeps, isolate state between tests.

❌ Brittle Tests - Tests that break when implementation details change (even though behavior is the same). Test behavior and outcomes, not implementation. Focus on inputs/outputs, not internal state.

❌ Slow Test Suites - Tests that take minutes to run mean developers won't run them frequently. Fix: Use mocks for unit tests, parallelize integration tests, optimize slow operations.

❌ Missing Edge Cases - Tests only cover happy paths, but bugs hide in edge cases. Test: nulls, empty arrays, malformed input, boundary conditions, error states, timeouts.

Debugging Agents

When tests fail (or agents misbehave), you need good debugging tools:

State Inspection - Step through agent execution state-by-state to see what's happening. LangGraph supports streaming state changes.

Trace Logging - Enable debug logging to see all operations. Every agent logs decisions, state changes, and external calls.

Breakpoint Debugging - Use Python debugger (pdb) to pause execution and inspect variables. Essential for understanding complex state.

Visual Debugging - Generate Mermaid diagrams of your state machine to visualize the graph structure and routing.

The Testing Mindset

Testing isn't about finding bugs (though it does that).

Testing is about building confidence that your autonomous agent will make the right decisions in production.

Every test is a specification:

"When there's a forced entry at 2 AM, escalate immediately"
"When confidence is below 0.7, ask a human"
"When policy denies an action, log and abort"

Write tests that document intent and prove correctness.

Best Practices

Start with unit tests - Fast feedback, high coverage on algorithms
Mock external dependencies - Keep tests fast and deterministic
Test edge cases - Nulls, errors, boundaries, malformed input
Use fixtures - Reusable test setup makes tests easier to write
Run tests in CI/CD - Automated validation on every commit
Measure coverage - Aim for 80%+ on critical decision paths
Performance test - Validate SLAs are actually met
Keep tests maintainable - Readable, focused, non-brittle

The Testing Culture

In CitadelMesh, we believe:

Autonomous agents you can trust aren't built on hope—they're built on tests.

Every agent includes comprehensive tests. Every PR requires tests. Every bug gets a regression test.

Testing isn't an afterthought—it's integral to building reliable autonomous systems.

When an agent makes a decision that affects a real building, you want to know it's been thoroughly validated. Tests give you that confidence.

Next Steps:

Remember: Well-tested agents are reliable agents. Build confidence through validation.

Why Testing Matters for Autonomous Agents​

The Testing Philosophy​

The Agent Testing Pyramid​

What Makes Agent Testing Different​

Testing Strategies​

Unit Testing: The Foundation​

Integration Testing: Real Interactions​

End-to-End Testing: Real Scenarios​

The Mock vs. Real Tradeoff​

Test Fixtures: Making Tests Easy​

What to Test​

For Every Agent​

Testing Pitfalls to Avoid​

Debugging Agents​

The Testing Mindset​

Best Practices​

The Testing Culture​