Agent Health Model
Objective
Agent Health measures how well an agent is performing within an evaluation window.
The model evaluates multiple aspects of an agent and converts each into a normalized score between 0 and 1, where:
- 1 = fully healthy
- 0 = fully unhealthy
Evaluation Window
All calculations are performed over a configurable rolling window.
Default: Last 24 Hours
Health States
| Score | State |
|---|---|
| >= 85 | Healthy |
| 65-84 | Degraded |
| < 65 | Unhealthy |
| N = 0 | Inactive |
Final Health Score
The overall health of an agent is computed as a weighted combination of four dimensions:
- Reliability (40%) — correctness of outcomes
- Performance (20%) — execution efficiency
- Stability (20%) — runtime consistency
- Operational (20%) — SLA adherence and trigger reliability
HealthScore =
40 * ReliabilityScore
+ 20 * PerformanceScore
+ 20 * StabilityScore
+ 20 * OperationalScore
Reliability Score (Weight: 40)
What it represents
Reliability measures how consistently the agent produces successful outcomes.
Inputs
- ESR = Expected Success Rate
- ASR = Actual Success Rate
- EFH = Expected Failures per Hour
- AFH = Actual Failures per Hour
- N = Total Executions
Success Score
Measures how well the actual success rate meets the expected success rate.
- If actual success rate meets or exceeds expectation, the score is 1
- Otherwise, the score is proportional to the shortfall
SuccessScore = Min(1, ASR / ESR)
Failure Score
Measures whether the actual failure rate stays within the expected limit.
- If actual failures are within expected limits, the score is 1
- Otherwise, the score decreases proportionally
FailureScore = Min(1, EFH / AFH)
Combine
The reliability score is derived by combining success and failure behavior, with higher emphasis on success.
- Success contributes 70%
- Failure contributes 30%
ReliabilityRaw =
0.7 * SuccessScore
+ 0.3 * FailureScore
Confidence Adjustment
To avoid misleading results when execution volume is low, a confidence factor is applied.
- N = total number of executions in the evaluation window
- N_full = minimum number of executions required for full confidence
ConfidenceFactor = Min(1, N / N_full)
Recommended:
N_full = 20
The final reliability score blends the observed reliability with a neutral baseline (1), based on confidence:
- When execution count is low → score stays closer to 1 (neutral)
- When execution count is sufficient → score reflects actual reliability
ReliabilityScore =
ConfidenceFactor * ReliabilityRaw
+ (1 - ConfidenceFactor) * 1
In low-volume scenarios, the system assumes the agent is healthy until sufficient data is available.
Performance Score (Weight: 20)
What it represents
Performance measures how efficiently the agent executes compared to expected runtime.
Inputs
- ER = Expected Runtime
- AR = Actual Runtime
- DeviationAllowed = Allowed Deviation (default 20%)
Deviation
Deviation measures how much the actual runtime differs from the expected runtime.
Deviation = (AR - ER) / ER
Score
PerformanceScore evaluates how closely execution time matches expectations.
- If the agent runs within expected time → score is 1
- If the runtime exceeds expectation → score decreases proportionally
- The penalty is capped using the allowed deviation
PerformanceScore =
if AR <= ER then 1
else 1 - Min(1, Deviation / DeviationAllowed)
Stability Score (Weight: 20)
What it represents
Stability measures whether the agent is producing abnormal or long-running executions.
It focuses on identifying signs of instability such as:
- executions taking significantly longer than expected
- potential hangs, retries, or resource contention
Input
S = Count of long-running instances (> 5x expected runtime)
Threshold
The threshold defines the level at which instability is considered significant.
- Below the threshold → impact on score is gradual
- At or beyond the threshold → maximum penalty is applied
This ensures that occasional delays do not overly affect the agent's health.
S_threshold = 15
Score
The score decreases as the number of long-running instances increases.
- If there are no long-running instances → score is 1
- As long-running instances increase → score gradually decreases
- The impact grows logarithmically, avoiding sudden drops for small issues
- The penalty is capped at the defined threshold
StabilityScore =
1 - Min(1, log2(S + 1) / log2(S_threshold + 1))
Operational Score (Weight: 20)
What it represents
Operational score measures how reliably the agent functions in real-world conditions.
It captures:
- Ability to meet defined SLAs (timeliness)
- Reliability of execution triggers
- Overall operational readiness of the agent
Inputs
- B = SLA breaches
- T = Trigger failures
Thresholds
The thresholds define the acceptable limits for operational deviations.
- Below the threshold → minimal impact on score
- At or beyond the threshold → maximum penalty is applied
B_threshold = 5
T_threshold = 5
Scores
Each component is normalized to a value between 0 and 1:
- No breaches or failures → score is 1
- Increasing breaches or failures → score decreases proportionally
- Penalty is capped at the defined threshold
SLAScore = 1 - Min(1, B / B_threshold)
TriggerScore = 1 - Min(1, T / T_threshold)
Combine
Operational score combines SLA compliance and trigger reliability, with higher weight given to SLA adherence.
OperationalScore =
0.6 * SLAScore
+ 0.4 * TriggerScore
Inactive Agents
An agent is considered Inactive when it has no executions within the evaluation window.
- N represents the total number of executions of the agent in the evaluation window.
If:
N = 0
Then:
- The agent state is marked as Inactive
- The agent is excluded from system health aggregation