Skip to main content
Benchmarks

Verified accuracy.
Not vibes.

Tested across 3 runs per configuration with full-text search (FTS) enabled and disabled. The top model hits 100% in both modes.

100% accuracy across 59 KB tests

Top model, FTS on and off

100% catalog accuracy

Across every tested model, 3 runs each

100% incident accuracy

Across every tested model, 3 runs each

Knowledge Base Agent

59 tests · 6 models · 3 runs each

Full-text search ON and OFF. Higher is better.

Claude 4.5 Sonnet

FTS On

100.0%

FTS Off

100.0%

Claude 4.6 Sonnet

FTS On

96.6%

FTS Off

96.0%

GPT-5.4

FTS On

98.8%

FTS Off

99.4%

GPT-5.2

FTS On

98.9%

FTS Off

93.2%

GPT-4.1

FTS On

97.7%

FTS Off

83.4%

GPT-4o

FTS On

85.3%

FTS Off

30.5%

Even with FTS off, top models hit 100%.
Older models degrade significantly without FTS.

Catalog Retrieval

100%

across every tested model, 3 runs each

Incident Retrieval

100%

across every tested model, 3 runs each

Methodology

Built-in eval framework. LLM-judge scored.

Test design

  • · 59-test KB suite plus catalog and incident retrieval suites
  • · 3 runs per configuration (variability check)
  • · Full-text search ON and OFF
  • · Real ServiceNow data, not synthetic

Scoring

  • · LLM-judge with structured rubric
  • · Pass / fail per test, averaged across runs
  • · Eval Run Console UI ships in product
  • · Customers can re-run on their own data

Models tested

  • · Claude 4.5 Sonnet, Claude 4.6 Sonnet
  • · GPT-4o, GPT-4.1, GPT-5.2, GPT-5.4
  • · Same agent prompts for every provider
  • · Same retrieval logic for every provider

Reproduce on your data

  • · Built-in eval framework + Run Console
  • · LLM-judge scoring out of the box
  • · Test any model on customer data before going live
  • · Documented in benchmark setup guide

Quantified business impact

Beyond accuracy: outcomes.

40–60%

Faster task completion

~50%

Less time writing scripts

+70%

KB accuracy vs keyword search

Reduced

L1 ticket volume (deflection)

Source: customer deployments and published L2H product documentation. Individual results vary by environment, KB quality, and operational maturity.

Test on your data.

The eval framework that produced these numbers ships in the product. Run it against your own KB before you sign anything.