Skip to main content
Version: Current

Agent Health Model

Objective

Agent Health measures how well an agent is performing within an evaluation window.

The model evaluates multiple aspects of an agent and converts each into a normalized score between 0 and 1, where:

  • 1 = fully healthy
  • 0 = fully unhealthy

Evaluation Window

All calculations are performed over a configurable rolling window.

Default: Last 24 Hours


Health States

ScoreState
>= 85Healthy
65-84Degraded
< 65Unhealthy
N = 0Inactive

Final Health Score

The overall health of an agent is computed as a weighted combination of four dimensions:

  • Reliability (40%) — correctness of outcomes
  • Performance (20%) — execution efficiency
  • Stability (20%) — runtime consistency
  • Operational (20%) — SLA adherence and trigger reliability
HealthScore =
40 * ReliabilityScore
+ 20 * PerformanceScore
+ 20 * StabilityScore
+ 20 * OperationalScore

Reliability Score (Weight: 40)

What it represents

Reliability measures how consistently the agent produces successful outcomes.

Inputs

  • ESR = Expected Success Rate
  • ASR = Actual Success Rate
  • EFH = Expected Failures per Hour
  • AFH = Actual Failures per Hour
  • N = Total Executions

Success Score

Measures how well the actual success rate meets the expected success rate.

  • If actual success rate meets or exceeds expectation, the score is 1
  • Otherwise, the score is proportional to the shortfall
SuccessScore = Min(1, ASR / ESR)

Failure Score

Measures whether the actual failure rate stays within the expected limit.

  • If actual failures are within expected limits, the score is 1
  • Otherwise, the score decreases proportionally
FailureScore = Min(1, EFH / AFH)

Combine

The reliability score is derived by combining success and failure behavior, with higher emphasis on success.

  • Success contributes 70%
  • Failure contributes 30%
ReliabilityRaw =
0.7 * SuccessScore
+ 0.3 * FailureScore

Confidence Adjustment

To avoid misleading results when execution volume is low, a confidence factor is applied.

  • N = total number of executions in the evaluation window
  • N_full = minimum number of executions required for full confidence
ConfidenceFactor = Min(1, N / N_full)

Recommended:

N_full = 20

The final reliability score blends the observed reliability with a neutral baseline (1), based on confidence:

  • When execution count is low → score stays closer to 1 (neutral)
  • When execution count is sufficient → score reflects actual reliability
ReliabilityScore =
ConfidenceFactor * ReliabilityRaw
+ (1 - ConfidenceFactor) * 1
tip

In low-volume scenarios, the system assumes the agent is healthy until sufficient data is available.


Performance Score (Weight: 20)

What it represents

Performance measures how efficiently the agent executes compared to expected runtime.

Inputs

  • ER = Expected Runtime
  • AR = Actual Runtime
  • DeviationAllowed = Allowed Deviation (default 20%)

Deviation

Deviation measures how much the actual runtime differs from the expected runtime.

Deviation = (AR - ER) / ER

Score

PerformanceScore evaluates how closely execution time matches expectations.

  • If the agent runs within expected time → score is 1
  • If the runtime exceeds expectation → score decreases proportionally
  • The penalty is capped using the allowed deviation
PerformanceScore =
if AR <= ER then 1
else 1 - Min(1, Deviation / DeviationAllowed)

Stability Score (Weight: 20)

What it represents

Stability measures whether the agent is producing abnormal or long-running executions.

It focuses on identifying signs of instability such as:

  • executions taking significantly longer than expected
  • potential hangs, retries, or resource contention

Input

S = Count of long-running instances (> 5x expected runtime)

Threshold

The threshold defines the level at which instability is considered significant.

  • Below the threshold → impact on score is gradual
  • At or beyond the threshold → maximum penalty is applied

This ensures that occasional delays do not overly affect the agent's health.

S_threshold = 15

Score

The score decreases as the number of long-running instances increases.

  • If there are no long-running instances → score is 1
  • As long-running instances increase → score gradually decreases
  • The impact grows logarithmically, avoiding sudden drops for small issues
  • The penalty is capped at the defined threshold
StabilityScore =
1 - Min(1, log2(S + 1) / log2(S_threshold + 1))

Operational Score (Weight: 20)

What it represents

Operational score measures how reliably the agent functions in real-world conditions.

It captures:

  • Ability to meet defined SLAs (timeliness)
  • Reliability of execution triggers
  • Overall operational readiness of the agent

Inputs

  • B = SLA breaches
  • T = Trigger failures

Thresholds

The thresholds define the acceptable limits for operational deviations.

  • Below the threshold → minimal impact on score
  • At or beyond the threshold → maximum penalty is applied
B_threshold = 5
T_threshold = 5

Scores

Each component is normalized to a value between 0 and 1:

  • No breaches or failures → score is 1
  • Increasing breaches or failures → score decreases proportionally
  • Penalty is capped at the defined threshold
SLAScore = 1 - Min(1, B / B_threshold)
TriggerScore = 1 - Min(1, T / T_threshold)

Combine

Operational score combines SLA compliance and trigger reliability, with higher weight given to SLA adherence.

OperationalScore =
0.6 * SLAScore
+ 0.4 * TriggerScore

Inactive Agents

An agent is considered Inactive when it has no executions within the evaluation window.

  • N represents the total number of executions of the agent in the evaluation window.

If:

  • N = 0

Then:

  • The agent state is marked as Inactive
  • The agent is excluded from system health aggregation