โ† Back

ClawBench: Evaluating Browser Agents on Live Production Websites with Submission-Interception

Eval & Observability arxiv arXiv:2604.08523 PDF โ†—
browserlivesubmissionwebsitesproductioninterceptioninterceptionclawbenchsites
Benchmarks browser agents on 283 everyday tasks (V1 153 + V2 130) across 163 live production sites, with a Chrome-extension plus CDP layer that blocks only the final write request so agents can run end-to-end on real sites without real-world side effects. Two-stage scoring (interception + LLM judge); leaderboard at https://claw-bench.com.
5~10๋ถ„. ์ œ๋ชฉโ†’์ดˆ๋กโ†’์ธํŠธ๋กœโ†’์„น์…˜ํ—ค๋”โ†’๊ทธ๋ฆผโ†’๊ฒฐ๋ก ๋งŒ.
ํŒ๋‹จ: ์–ด๋–ค ๋ฌธ์ œ๋ฅผ ํ’€๊ณ  / ํ•ต์‹ฌ ์•„์ด๋””์–ด / ๋‚ด ์ž‘์—…๊ณผ ๊ด€๋ จ ์žˆ๋‚˜?
~1์‹œ๊ฐ„. ๊ทธ๋ฆผยทํ‘œ๋ฅผ ๊ผผ๊ผผํžˆ. ์ฆ๋ช…ยท์ˆ˜์‹ ๋””ํ…Œ์ผ์€ ๊ฑด๋„ˆ๋œ€.
์‚ฐ์ถœ๋ฌผ: "์ด๋“ค์ด ๋ญ˜ ํ–ˆ๊ณ  ์™œ ๊ทธ๊ฒŒ ํ†ตํ•˜๋Š”๊ฐ€" ํ•œ ๋ฌธ๋‹จ.
์žฌํ˜„ํ•˜๋“ฏ ์ฝ๊ธฐ. ๊ฐ€์ •์„ ์˜์‹ฌ. ์ง์ ‘ ์ธ์šฉ/๋ฐ˜๋ฐ•ํ•  ๋…ผ๋ฌธ๋งŒ.
๋ Œ์ฆˆ: "๋‚ด ํ”Œ๋ฆฟ์—์„œ ์ธก์ •ํ•˜๋ฉด ์ €์ž๊ฐ€ ๋ชป ํ•œ ๋ฌด์—‡์„ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ๋‚˜?"
View in Knowledge Graph โ†’