Flaky Test Detection

Flaky test detection in ContextQA — AI-powered failure classification identifies intermittent failures and separates them from real application bugs.

circle-info

Who is this for? QA managers and engineering managers who need to separate intermittent test noise from real regressions — so failures that matter get attention and flaky tests don't block deployments.

Flaky test: A test case that produces inconsistent results — passing on some executions and failing on others — without any change to the application code or test definition, typically caused by timing issues, environment variability, or non-deterministic UI behavior.

Flaky tests are the primary reason development teams lose confidence in automated test suites. When every pipeline failure requires human triage to determine whether it is a real regression or noise, velocity drops and eventually the suite is ignored. ContextQA addresses this by classifying every failure with an AI-derived root cause category, making flakiness visible as a distinct failure type rather than an undifferentiated red status.

What is a flaky test?

A flaky test passes on some runs and fails on others under conditions that have not changed — same code, same environment, same test definition. Common causes include:

  • Timing dependencies: The test clicks a button before an async operation completes.

  • Order dependence: The test relies on state left by a previous test case that sometimes runs in a different order.

  • Environment variability: Network latency spikes, DNS resolution delays, or shared database contention.

  • Non-deterministic UI: Animations, lazy-loaded components, or third-party widgets that render at unpredictable times.

ContextQA defines flakiness as a failure category distinct from an application bug. If the application has a genuine regression, every run fails consistently and ContextQA classifies it as an application bug. If the failure is in the test logic, ContextQA classifies it as a test bug. If the failure is intermittent with no reproducible cause in the application, ContextQA classifies it as a flaky failure.

How ContextQA classifies failures

ContextQA uses AI root cause analysis on every failed test case. The analysis pipeline examines:

  • The failing step and the error message

  • The browser console log for JavaScript errors

  • The HAR network log for failed or slow requests

  • The DOM state at the time of failure (from the Playwright trace)

  • Historical execution data for the same test case

From this data, ContextQA assigns one of four failure categories:

Category
Meaning

Application Bug

The application behaved incorrectly; the test is functioning as designed

Test Bug

The test definition has an error — incorrect selector, wrong expected value, missing wait

Flaky Failure

The failure is intermittent; root cause is timing, environment variability, or non-determinism

Environment Issue

Infrastructure-level failure — network timeout, missing credential, environment not responding

The AI explanation accompanying each classification states specifically what evidence led to the classification. For a flaky failure, the explanation typically notes that the test has both passed and failed on identical code, and identifies the specific step and condition that is non-deterministic.

Accessing flaky test data in the Analytics Dashboard

  1. Open Analytics in the left navigation.

  2. Click the Execution Dashboard tab.

  3. Locate the Consistently Failing Tests widget. This widget lists test cases with repeated failures across recent runs, ranked by failure frequency.

  4. Click any test case in the widget to open the failure detail panel.

  5. The detail panel shows the failure category distribution (how many runs were classified as flaky vs. application bug vs. other) and the AI explanation for each failure type.

The Consistently Failing Tests widget surfaces test cases that failed in multiple consecutive runs. ContextQA distinguishes between cases that always fail (likely an application bug or test bug) and cases that alternate between passing and failing (likely flaky). The visual indicator for flakiness is a mixed pass/fail run history in the sparkline column.

For a broader view, the Failure Analysis report (accessed from Analytics → Failure Analysis) shows aggregate failure categories across an entire test suite or date range, allowing you to measure what percentage of your failures are flaky versus genuine regressions.

Using the get_root_cause MCP tool

The get_root_cause MCP tool returns the AI failure classification for a specific execution programmatically. The response includes:

  • failureCategory: one of APPLICATION_BUG, TEST_BUG, FLAKY_FAILURE, ENVIRONMENT_ISSUE

  • explanation: the AI's natural-language explanation of the root cause

  • fixSuggestion: a concrete suggestion for resolving the failure

  • affectedStep: the step number and action that failed

When building CI integrations that need to distinguish "block the release" from "likely noise," use the failureCategory field from get_root_cause to gate your pipeline logic. A pipeline that fails the build on APPLICATION_BUG but creates a Jira ticket and continues on FLAKY_FAILURE is a common pattern for teams managing large suites.

Example MCP invocation pattern:

What to do when a test is classified as flaky

ContextQA classifies the failure; resolving it requires one of three approaches depending on the root cause:

1. Fix timing issues. If the AI reasoning log identifies a race condition (for example, clicking an element before it is interactable), add an explicit wait step in the test case. In ContextQA's step editor, add a Wait action before the problematic step. The appropriate wait target is either a specific element becoming visible or a network request completing.

2. Configure retry behavior. ContextQA test plans support a recovery action for failed test cases. In Test Plans → [Plan] → Settings, the Recovery Action field accepts Run_Next_Testcase as a value. This instructs ContextQA to continue executing remaining test cases when one fails, rather than halting the plan. Combine this with post-execution flakiness analysis rather than treating every failure as a blocker.

3. Isolate environment dependencies. If ContextQA consistently classifies failures as ENVIRONMENT_ISSUE for a specific test case, the test may be hitting a dependency that is unreliable in your staging environment. Use a dedicated test environment or mock the external dependency.

Retry configuration and the recovery action

The Recovery Action in test plan settings controls what ContextQA does when a test case fails mid-plan:

Recovery Action
Behavior

Stop

Halt plan execution immediately on first failure

Run_Next_Testcase

Mark the failed case and continue executing remaining cases

Run_Next_Testcase is recommended for suites with known flaky tests. It allows the plan to complete, giving you a full picture of which failures are consistent and which are intermittent. After the run, use the failure category data from the Analytics Dashboard to triage.

Frequently Asked Questions

How many runs does ContextQA need before it can identify a flaky test?

ContextQA can classify a single failure as FLAKY_FAILURE on the first occurrence if the AI reasoning from the HAR, console log, and DOM snapshot is consistent with timing or environment variability. The Consistently Failing Tests widget in the Analytics Dashboard requires at least two runs to establish a pattern, but single-run classification is always available via the get_root_cause MCP tool.

Does ContextQA automatically retry flaky tests?

ContextQA does not automatically retry individual test cases within an execution. The platform classifies failures and surfaces them for human action. Automatic retries can mask genuine regressions, so ContextQA's design philosophy is to classify and report rather than silently re-run. Use the retry logic in your CI pipeline if you want to re-run an entire plan.

Can I mark a test case as "known flaky" to suppress notifications?

There is no explicit "known flaky" flag on test cases in the current UI. The recommended approach is to use the failure category data from get_root_cause in your CI integration to suppress notifications for FLAKY_FAILURE classifications while still alerting on APPLICATION_BUG.

circle-info

Get release readiness reports your stakeholders understand. Book a Demo →arrow-up-right — See the analytics dashboard, failure analysis, and flaky test detection for your test suite.

Last updated

Was this helpful?