Verified accuracy.
Not vibes.
Tested across 3 runs per configuration with full-text search (FTS) enabled and disabled. The top model hits 100% in both modes.
100% accuracy across 59 KB tests
Top model, FTS on and off
100% catalog accuracy
Across every tested model, 3 runs each
100% incident accuracy
Across every tested model, 3 runs each
Knowledge Base Agent
59 tests · 6 models · 3 runs each
Full-text search ON and OFF. Higher is better.
Claude 4.5 Sonnet
100.0%
100.0%
Claude 4.6 Sonnet
96.6%
96.0%
GPT-5.4
98.8%
99.4%
GPT-5.2
98.9%
93.2%
GPT-4.1
97.7%
83.4%
GPT-4o
85.3%
30.5%
Catalog Retrieval
100%
across every tested model, 3 runs each
Incident Retrieval
100%
across every tested model, 3 runs each
Methodology
Built-in eval framework. LLM-judge scored.
Test design
- · 59-test KB suite plus catalog and incident retrieval suites
- · 3 runs per configuration (variability check)
- · Full-text search ON and OFF
- · Real ServiceNow data, not synthetic
Scoring
- · LLM-judge with structured rubric
- · Pass / fail per test, averaged across runs
- · Eval Run Console UI ships in product
- · Customers can re-run on their own data
Models tested
- · Claude 4.5 Sonnet, Claude 4.6 Sonnet
- · GPT-4o, GPT-4.1, GPT-5.2, GPT-5.4
- · Same agent prompts for every provider
- · Same retrieval logic for every provider
Reproduce on your data
- · Built-in eval framework + Run Console
- · LLM-judge scoring out of the box
- · Test any model on customer data before going live
- · Documented in benchmark setup guide
Quantified business impact
Beyond accuracy: outcomes.
40–60%
Faster task completion
~50%
Less time writing scripts
+70%
KB accuracy vs keyword search
Reduced
L1 ticket volume (deflection)
Source: customer deployments and published L2H product documentation. Individual results vary by environment, KB quality, and operational maturity.
Test on your data.
The eval framework that produced these numbers ships in the product. Run it against your own KB before you sign anything.