Why Multi-Agent?

Single-LLM approaches to test automation hit walls quickly. You either get a model that's great at understanding requirements but produces brittle selectors, or one that's technically precise but misses the intent of what you're trying to test.

The multi-agent approach splits responsibilities across specialized agents, each optimized for a specific task. They communicate through structured protocols, building on each other's outputs rather than trying to do everything in one prompt.

The Four Agents

1. Analyst Agent

Takes user stories, acceptance criteria, or plain-language test descriptions and produces structured test scenarios. It understands the business domain and identifies edge cases that humans often miss.

Input: "Users should be able to reset their password via email"

Output: Structured scenarios covering happy path, invalid email, expired tokens, rate limiting, and email delivery failures.

2. Writer Agent

Converts structured scenarios into executable Playwright test code. It consults the DOM Memory to use known-good selectors and follows team coding conventions stored in context.

The Writer is determinism-first: it prefers explicit waits, data-testid selectors, and predictable patterns over clever heuristics.

3. Executor Agent

Runs tests through Playwright via MCP (Model Context Protocol), capturing detailed execution traces, screenshots, and network activity. When tests pass, it updates DOM Memory with confirmed-working selectors.

4. Healer Agent

When tests fail due to selector changes or UI updates, the Healer analyzes the failure, inspects the current DOM state, and proposes fixes. Unlike naive "self-healing" that just finds any matching element, the Healer validates that the proposed fix still tests the original intent.

DOM Memory Architecture

The secret sauce is persistent DOM Memory—a knowledge base of selector mappings, element relationships, and historical stability data. When the Healer fixes a test, that fix propagates to all tests using the same element. When a selector proves unreliable across multiple runs, it gets flagged for human review.

Structure:

{
  "element_id": "checkout-button",
  "selectors": [
    { "type": "data-testid", "value": "[data-testid='checkout-btn']", "confidence": 0.98 },
    { "type": "css", "value": ".cart-summary button.primary", "confidence": 0.72 }
  ],
  "stability_score": 0.94,
  "last_seen": "2026-03-15T14:22:00Z",
  "related_elements": ["cart-total", "shipping-address"],
  "context": "E-commerce checkout flow"
}

MCP Integration

The framework uses Anthropic's Model Context Protocol to connect agents with Playwright. MCP provides a standardized way for LLMs to invoke browser actions, read DOM state, and capture artifacts without custom integration code.

This means the same agent prompts work across different LLM providers—swap GPT-4o for Claude or a local model without rewriting the orchestration logic.

Current State

The framework is in active development at ALDAR, running against our Broker Portal and internal property management tools. Early results show 40% reduction in test maintenance time and significantly faster coverage expansion for new features.

I'm not open-sourcing this yet—the current implementation is too tightly coupled to our internal systems. But I'm working toward a generalized version that others can use as a starting point.

Principles Driving the Design

  • Determinism first: AI suggestions are always validated by deterministic execution. If a fix can't be verified, it's not applied automatically.
  • Human in the loop: The system proposes; humans approve. Especially for changes that affect test intent, not just implementation.
  • Transparent reasoning: Every agent decision is logged with rationale. When something goes wrong, you can trace exactly why.
  • Cost-aware: LLM calls are expensive at scale. The framework batches operations, caches aggressively, and only invokes AI when heuristics fail.