Policy as Code Benchmark

2026-04-15 · 123 tasks · 3 tools · Bare-minimum prompts, containerized isolation

This benchmark evaluates AI-powered tools on Kyverno Policy as Code tasks — converting legacy ClusterPolicies to the Kyverno 1.16+ format and generating new policies from natural-language descriptions. Each tool receives identical inputs and bare-minimum prompts inside containerized isolation, and its output is validated through schema checks, CEL expression analysis, and functional tests using the Kyverno CLI.

Tools are ranked by pass rate only — the percentage of tasks that pass all validation layers. Speed and cost are reported as supplementary metrics. No composite scores or arbitrary weights.

Methodology: Pass rates shown are the mean across 3 independent runs per tool on the same dataset. LLM-backed benchmarks have inherent sampling variance — single-run scores are noisy. Reporting the mean (and per-policy pass-rate) gives a more representative measure of each tool’s actual accuracy than a one-shot result.

View source, methodology & full documentation on GitHub →

Leaderboard — ranked by pass rate

Tool Pass Rate Schema + CEL Functional Avg Time Avg Cost
1 nctl 98% 41 / 41 41 / 41 120.4s $0.0073
2 cursor 65% 41 / 41 22 / 37 76.2s $0.0078
3 claude 32% 10 / 41 10 / 41 100.1s $0.0072

Pass Rate Comparison

Accuracy vs Cost

Per-Task Results — click a cell for details

Policy Kind claudecursornctl
cp_add_default_labels MutatingPolicy FAIL (1/3) PASS PASS
cp_add_default_resources MutatingPolicy FAIL FAIL PASS
cp_add_ndots MutatingPolicy PASS (2/3) PASS PASS
cp_add_ns_quota GeneratingPolicy FAIL FAIL PASS
cp_add_safe_to_evict MutatingPolicy FAIL (1/3) FAIL PASS
cp_add_securitycontext MutatingPolicy PASS PASS PASS
cp_add_tolerations MutatingPolicy PASS PASS (2/3) PASS
cp_always_pull_images MutatingPolicy FAIL (1/3) FAIL PASS
cp_block_stale_images ValidatingPolicy FAIL FAIL PASS
cp_check_nvidia_gpu ValidatingPolicy PASS (2/3) FAIL (1/3) PASS
cp_create_default_pdb GeneratingPolicy FAIL FAIL PASS
cp_disallow_cri_sock_mount ValidatingPolicy FAIL (1/3) PASS PASS
cp_disallow_default_namespace ValidatingPolicy FAIL PASS PASS
cp_disallow_host_namespaces ValidatingPolicy FAIL PASS PASS
cp_disallow_latest_tag ValidatingPolicy FAIL PASS PASS
cp_disallow_privileged ValidatingPolicy FAIL (1/3) PASS PASS
cp_enforce_resources_as_ratio ValidatingPolicy FAIL (1/3) FAIL (1/3) PASS
cp_inject_sidecar MutatingPolicy FAIL FAIL PASS (2/3)
cp_kasten_generate_backup GeneratingPolicy FAIL FAIL PASS
cp_kasten_generate_by_label GeneratingPolicy FAIL FAIL PASS (2/3)
cp_limit_configmap_for_sa ValidatingPolicy FAIL (1/3) PASS (2/3) PASS
cp_pdb_minavailable ValidatingPolicy FAIL FAIL PASS
cp_require_drop_all ValidatingPolicy FAIL (1/3) PASS PASS
cp_require_labels ValidatingPolicy FAIL PASS PASS
cp_require_pdb ValidatingPolicy FAIL FAIL (1/3) PASS
cp_require_probes ValidatingPolicy FAIL (1/3) PASS PASS
cp_require_resource_limits ValidatingPolicy FAIL PASS PASS
cp_require_ro_rootfs ValidatingPolicy FAIL PASS PASS
cp_restrict_capabilities ValidatingPolicy FAIL PASS PASS
cp_restrict_image_registries ValidatingPolicy FAIL (1/3) PASS PASS
cp_restrict_ingress_host ValidatingPolicy FAIL FAIL PASS
cp_restrict_nodeport ValidatingPolicy FAIL PASS PASS
gen_add_default_labels MutatingPolicy FAIL (1/3) PASS PASS
gen_create_networkpolicy GeneratingPolicy FAIL PASS PASS
gen_disallow_capabilities ValidatingPolicy PASS FAIL (1/3) PASS
gen_disallow_host_namespaces ValidatingPolicy FAIL (1/3) PASS PASS
gen_disallow_host_path ValidatingPolicy PASS (2/3) PASS PASS
gen_disallow_privileged ValidatingPolicy PASS PASS PASS
gen_require_labels ValidatingPolicy PASS PASS PASS
gen_require_resource_limits ValidatingPolicy PASS PASS PASS
gen_restrict_registries ValidatingPolicy PASS PASS PASS