Policy as Code Benchmark

2026-05-20 · 150 tasks · 3 tools · Bare-minimum prompts, containerized isolation

This benchmark evaluates AI-powered tools on Kyverno Policy as Code tasks — converting legacy ClusterPolicies to the Kyverno 1.16+ format and generating new policies from natural-language descriptions. Each tool receives identical inputs and bare-minimum prompts inside containerized isolation, and its output is validated through schema checks, CEL expression analysis, and functional tests using the Kyverno CLI.

Tools are ranked by pass rate only — the percentage of tasks that pass all validation layers. Speed and cost are reported as supplementary metrics. No composite scores or arbitrary weights.

View source, methodology & full documentation on GitHub →

Leaderboard — ranked by pass rate

Tool Pass Rate Schema + CEL Functional Avg Time Avg Cost
1 nctl 98% 50 / 50 49 / 50 72.2s $0.0089
2 claude 62% 36 / 50 31 / 50 235.8s $0.0075
3 cursor 58% 40 / 50 31 / 50 88.8s $0.0097

Kyverno CLI Test Generation

Composite pass = schema valid and kyverno test exits 0 and suite has both pass and fail cases. Coverage score = generated tuples / oracle tuples (capped at 1.0, not gated).

Tool Composite Pass Avg Coverage Has Pass+Fail Avg Time Avg Cost
nctl 6 / 6 89% 4 / 6 66.4s $0.0063
claude 6 / 6 90% 4 / 6 166.5s $0.0057
cursor 4 / 6 94% 4 / 6 116.3s $0.0058

Chainsaw Test Generation

Composite pass = schema valid and chainsaw test exits 0 and suite has both pass and fail scenarios. Coverage score = generated scenarios / oracle scenarios (capped at 1.0, not gated).

Tool Composite Pass Avg Coverage Has Pass+Fail Avg Time Avg Cost
nctl 1 / 1 0% 1 / 1 186.0s $0.0697
cursor 1 / 1 0% 1 / 1 274.9s $0.0754
claude 0 / 1 0% 0 / 1 600.1s -

Pass Rate Comparison

Accuracy vs Cost

Per-Task Results — click a cell for details

Policy Kind claudecursornctl
ch_kyverno_helm_install None FAIL PASS PASS
cp_add_default_labels MutatingPolicy PASS PASS PASS
cp_add_default_resources MutatingPolicy FAIL FAIL PASS
cp_add_ndots MutatingPolicy FAIL FAIL PASS
cp_add_ns_quota GeneratingPolicy FAIL FAIL FAIL
cp_add_safe_to_evict MutatingPolicy PASS PASS PASS
cp_add_securitycontext MutatingPolicy PASS PASS PASS
cp_add_tolerations MutatingPolicy PASS FAIL PASS
cp_always_pull_images MutatingPolicy FAIL FAIL PASS
cp_block_stale_images ValidatingPolicy FAIL FAIL PASS
cp_check_nvidia_gpu ValidatingPolicy PASS FAIL PASS
cp_create_default_pdb GeneratingPolicy FAIL FAIL PASS
cp_disallow_cri_sock_mount ValidatingPolicy FAIL FAIL PASS
cp_disallow_default_namespace ValidatingPolicy PASS FAIL PASS
cp_disallow_host_namespaces ValidatingPolicy PASS PASS PASS
cp_disallow_latest_tag ValidatingPolicy PASS PASS PASS
cp_disallow_privileged ValidatingPolicy PASS PASS PASS
cp_enforce_resources_as_ratio ValidatingPolicy PASS PASS PASS
cp_inject_sidecar MutatingPolicy FAIL FAIL PASS
cp_kasten_generate_backup GeneratingPolicy FAIL FAIL PASS
cp_kasten_generate_by_label GeneratingPolicy FAIL FAIL PASS
cp_limit_configmap_for_sa ValidatingPolicy PASS FAIL PASS
cp_pdb_minavailable ValidatingPolicy FAIL PASS PASS
cp_require_drop_all ValidatingPolicy FAIL PASS PASS
cp_require_labels ValidatingPolicy PASS PASS PASS
cp_require_pdb ValidatingPolicy FAIL FAIL PASS
cp_require_probes ValidatingPolicy FAIL PASS PASS
cp_require_resource_limits ValidatingPolicy PASS PASS PASS
cp_require_ro_rootfs ValidatingPolicy PASS PASS PASS
cp_restrict_capabilities ValidatingPolicy PASS PASS PASS
cp_restrict_image_registries ValidatingPolicy PASS PASS PASS
cp_restrict_ingress_host ValidatingPolicy FAIL FAIL PASS
cp_restrict_nodeport ValidatingPolicy PASS PASS PASS
df_check_apt_force_yes ValidatingPolicy FAIL PASS PASS
gen_add_default_labels MutatingPolicy PASS PASS PASS
gen_create_networkpolicy GeneratingPolicy FAIL PASS PASS
gen_disallow_capabilities ValidatingPolicy PASS FAIL PASS
gen_disallow_host_namespaces ValidatingPolicy PASS FAIL PASS
gen_disallow_host_path ValidatingPolicy PASS PASS PASS
gen_disallow_privileged ValidatingPolicy PASS PASS PASS
gen_require_labels ValidatingPolicy PASS PASS PASS
gen_require_resource_limits ValidatingPolicy FAIL PASS PASS
gen_restrict_registries ValidatingPolicy PASS PASS PASS
tf_s3_no_wildcard_principal ValidatingPolicy PASS FAIL PASS
tg_cp_disallow_default_namespace None PASS PASS PASS
tg_cp_inject_sidecar None PASS FAIL PASS
tg_cp_kasten_generate_backup None PASS FAIL PASS
tg_cp_require_drop_all None PASS PASS PASS
tg_cp_require_labels None PASS PASS PASS
tg_vpol_block_ephemeral_containers None PASS PASS PASS