AI Agent Papers 2026
← Collections
Benchmarks browser agents on 283 everyday tasks (V1 153 + V2 130) across 163 live production sites, with a Chrome-extension plus CDP layer that blocks only the final write request …
Compares attribution-based explanations with trace-based diagnostics across static and agentic settings to study how explainability methods translate to multi-step agent trajectori…
Investigates whether agents can accurately predict their own success rates in agentic tasks.
Introduces 20 research tasks from real ML papers covering idea generation, experiments, and refinement for benchmarking science agents.
Proposes evaluating agent outputs by decomposing responses into individual claims and checking each against expert knowledge.
Explores using multi-agent debate to fill missing labels in information retrieval benchmarks.
Proposes a specialized verifier that detects and locates errors in agent execution trajectories at runtime to enable precise rollback-and-retry.
Examines whether GPT-4/5 agents can reproduce aggregate human cognitive biases in interactive decision-making scenarios.
Proposes generating families of equivalent CTF challenges through code transformations to test whether agents truly understand exploits or just memorize patterns.
Introduces a negotiation benchmark where frontier LLM agents are evaluated against MBA students to reveal cross-model differences in deception, accuracy, and trustworthiness.
Benchmarks how well conversational agents retain and use personal information over long emotional support conversations.
Introduces a benchmark that replays published human-subject experiments with LLM agents to test how well they simulate real participants.
Proposes an expert-designed multi-turn insurance underwriting benchmark to evaluate agent performance under real-world enterprise conditions with noisy tools and proprietary knowle…
Proposes automated state abstraction from agent execution traces using predicate trees and counterexample refinement for probabilistic runtime verification of agent behavior.
Compares three LLM agent frameworks (Aider, OpenHands, SWE-agent) on vulnerability false positive filtering to study how agent design and backbone model affect triage performance.
Analyzes 8,106 fix-related pull requests from five AI coding agents to catalog the reasons agent-generated contributions are closed without merging.
Proposes a judge agent framework that evaluates query-response pairs jointly across a cohort rather than in isolation, using in-context neighborhoods for cross-instance pattern det…
Evaluates LLM reasoning under ReAct and Plan-and-Execute agentic workflows across 48,000 simulated failure scenarios, producing a taxonomy of 16 common reasoning failures.
Introduces a benchmark for evaluating LLM agent consistency, uncertainty handling, and capability awareness in multi-turn tool-using scenarios with incomplete or ambiguous user req…
Examines code quality, maintainability, and reviewer sentiment toward AI-agent-generated pull requests compared to human-authored contributions.
Analyzes silent (no-comment) AI-generated pull requests to examine their impact on code complexity, quality issues, and security vulnerabilities.
Analyzes over 1,300 agent benchmarks against public-sector requirements including process-based evaluation, realism, and domain-specific metrics.
Applies Shapley values to attribute emergent extreme events in LLM multi-agent systems to specific agent actions across time, agent, and behavior dimensions.
Analyzes AI agent contributions to documentation pull requests and examines how human developers review and intervene in agent-authored documentation changes.
Examines how core and peripheral developers differ in their use, review, modification, and verification of coding-agent-generated pull requests.
Introduces an end-to-end benchmark with 700+ real-world tasks across build, monitoring, issue resolving, and test generation for evaluating AI agents in full software DevOps workfl…
Proposes an architecture-informed evaluation approach that links agent components like planners, memory, and tool routers to observable behaviors and diagnostic metrics.
Investigates whether smaller-scale language models can reduce energy consumption in multi-agent agentic AI systems without compromising task quality.
Analyzes 19,450 inline review comments on agent-authored pull requests and derives a taxonomy of 12 review themes to understand how reviewers respond to AI-generated code.
Analyzes 40,214 developer and agentic pull requests to compare merge outcomes and identify how submitter attributes and review features differ between human and AI agent contributi…
Presents structural testing methods for LLM-based agents using OpenTelemetry traces, mocking for reproducible behavior, and automated assertions for component-level verification.
Analyzes how five AI coding agents interact with CI/CD configurations across 8,031 pull requests, examining modification frequency, merge rates, and build success.
Identifies behavioral signatures of five AI coding agents from 33,580 pull requests using commit, PR structure, and code features for agent attribution.
Assesses existing interpretability methods for agentic systems and identifies gaps in explaining temporal dynamics, compounding decisions, and context-dependent behaviors.
Investigates maintainability and security-related build code smells in AI-agent-generated pull requests across 364 identified quality issues.
Examines long-term survival of AI-agent-generated code through survival analysis of 200,000+ code units across 201 open-source projects.
Develops an oracle counterfactual framework for multi-turn agentic tasks that measures the criticality of individual capabilities like planning and state tracking.
Presents a 12-category error taxonomy and diagnostic framework for evaluating tool-use reliability across open-weight and proprietary LLMs in multi-agent systems on edge hardware.
Introduces the problem of agentic confidence calibration and proposes Holistic Trajectory Calibration, extracting process-level features across an agent's entire trajectory to diag…
Examines methodological challenges in evaluating AI agents across sensitive information leakage, fraud, and cybersecurity threats through a multi-national collaborative benchmarkin…
Introduces a multi-agent framework that generates verified, domain-specific, multimodal, multi-hop question-answer datasets for benchmarking retrieval-augmented generation systems.
Analyzes 1,187 bug reports from LLM agent software across seven frameworks to categorize bug types, root causes, effects, and tests automated bug labeling with a ReAct agent.
Proposes a hierarchical framework for general agentic attribution that identifies internal factors driving agent actions through temporal likelihood dynamics and perturbation-based…
Analyzes token consumption patterns across software development lifecycle stages in a multi-agent system to identify where tokens are consumed and which stages drive cost.
Introduces a benchmark of 480 long-horizon, cross-application productivity tasks created by investment banking analysts, consultants, and lawyers for evaluating AI agent capabiliti…
Introduces a benchmark of 600+ collaborative coding tasks to evaluate whether coding agents can coordinate as effective teammates under various coordination structures.
Investigates how RAG systems can game nugget-based LLM judge evaluations through metric overfitting, demonstrating near-perfect scores when evaluation elements are leaked or predic…
Introduces the Determinism-Faithfulness Assurance Harness for measuring trajectory determinism and evidence-conditioned faithfulness in tool-using LLM agents across 74 configuratio…
Presents a process-aware and auditable multi-agent evaluation framework that plans, executes, and aggregates multi-step evaluations across heterogeneous agentic workflows under hum…
Introduces a curated benchmark of 89 hard tasks in computer terminal environments with unique environments, human-written solutions, and comprehensive tests for evaluating frontier…
Introduces a benchmark and evaluation framework for agentic task-oriented dialogue systems covering multi-goal coordination, dependency management, memory, adaptability, and proact…
Decouples task execution from environment understanding with a deterministic QA paradigm to study whether task success is actually a good proxy for how well agents understand their…
Evaluates frontier models on 150 workplace tasks to identify an empirical hierarchy of agentic capabilities spanning tool use, planning, adaptability, groundedness, and common-sens…
Introduces a multimodal RAG benchmark with 26K pages and 3,099 queries in 6 languages to evaluate retrieval across non-textual elements and open-ended queries.
Evaluates LLM agent social behaviors in mixed-motive games using process-aware analysis of both reasoning and communication rather than outcome-only metrics.
Benchmarks whether agents can proactively use long-term memory to execute tool-based actions, rather than just passively retrieving facts on demand.
Proposes a formal framework for actively evaluating general-purpose agents across multiple tasks, selecting which tasks and agents to sample next to minimize ranking error over tim…
Introduces an Unreal Engine 5 simulation platform for benchmarking LLM-driven agents on embodied tasks including navigation, object manipulation, and multi-agent coordination in pr…
Presents an open-source platform combining visual workflow orchestration with LLM-as-a-Judge evaluation for prototyping and validating RAG-based agent pipelines without infrastruct…
Benchmarks model robustness across 11 RAG, reasoning, alignment, and tool-use tasks against diverse contextual noise types including random documents, irrelevant histories, and har…
Introduces a project-oriented memory benchmark with 2,000+ cross-session dialogues across eleven scenarios to evaluate how well agents track evolving goals and dynamic context depe…
Introduces the first benchmark for interactive deep research combining a modular multi-agent framework with on-demand user interaction, a scalable user simulator, and interaction-a…
Introduces an open-world tool-using environment with 5,571 tools across 204 apps, a task engine for multi-tool workflows with wild constraints, and a state controller that injects …
Introduces a tower defense environment for evaluating LLM agent planning and decision-making with low computational demands, multimodal observation, and hallucination assessment su…
Introduces a user-authored benchmark for memory-aware LLM agents in Minecraft with parametric task templates, machine-checkable validators, and bounded-knowledge evaluation under a…
Proposes a framework for detecting tool-calling hallucinations in LLM agents by analyzing internal representations during a single forward pass, targeting incorrect tool selection,…
Surveys the evolution from LLM-as-a-Judge to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory f…
Introduces the first benchmark for evaluating tool-calling and agentic capabilities of LLMs in Arabic, measuring functional accuracy and robustness in Arabic agentic workflows.
Examines how Big Five personality steering affects cooperative behavior in LLM agents using repeated Prisoner's Dilemma games across multiple model generations.
Analyzes message-code inconsistency in pull requests authored by AI coding agents across five agent systems to study trustworthiness of agent-generated PR descriptions.
Proposes a multi-agent framework for autonomous exploratory GUI testing that decouples navigation from verification via planning-execution and hierarchical reflection modules.
Introduces the concept of agent drift and a composite metric framework for quantifying semantic, coordination, and behavioral degradation in multi-agent LLM systems over extended i…
Introduces a unified benchmark for evaluating Multi-Agent Debate methods across multiple domains, modalities, and efficiency metrics including token consumption and inference time.
Documents six recurring failure modes across four end-to-end attempts at autonomous ML research using a pipeline of LLM agents mapped to stages of the scientific workflow.
Introduces a data analysis benchmark for evaluating LLM agents under documentation-intensive analytical workflows requiring long document navigation and multi-step computation.
Proposes a closed-loop multi-agent testing framework with generation, execution analysis, and review optimization agents for autonomous software test refinement.
Proposes a causal framework using structural causal models and counterfactual interventions to audit whether reasoning traces in LLM agents are faithful generative drivers or post-…
Introduces a benchmark for evaluating agent reliability across consistency, robustness to perturbations, and fault tolerance under chaos-engineering-style tool failure injection.
Introduces an evaluation suite that standardizes MAS configuration and execution, exports framework-agnostic execution traces, and enables systematic reliability assessment across …
Introduces a benchmark for evaluating LLM agent function-calling under realistic API complexity including noisy outputs, detailed specifications, and runtime challenges.
LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the f…
Eval & Observability 2503.16416 notes β†’ πŸ’¬ Tier 1. μ—μ΄μ „νŠΈ 평가 μ§€ν˜• 전체 지도. νŠΈλΌμ ν† λ¦¬ 평가, MCP Atlas, Too…
Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attem…
Eval & Observability 2603.29231 notes β†’ πŸ’¬ Tier 1. 신뒰성을 κ³Όν•™μœΌλ‘œ λ‹€λ£¨λŠ” 둱호라이즌 ν”„λ ˆμž„μ›Œν¬. Kyle의 control-t…
Anthropic λΈ”λ‘œκ·Έ. μ—μ΄μ „νŠΈ 평가가 μ™œ μ–΄λ €μš΄μ§€ 직관을 제곡. Tier 2 평가 논문듀을 읽기 전에 이 κΈ€λ‘œ λ§₯락을 작으면 훨씬 잘 μ½νžŒλ‹€.
Eval & Observability blog/anthropic/demys notes β†’ πŸ’¬ Tier 0 β€” λ…Όλ¬Έ μ•„λ‹˜. μ—μ΄μ „νŠΈ 평가 직관 작기용. Tier 2 평가 λ…Όλ¬Έ(2503.…
Tool-using multi-agent large language model (LLM) systems spend computation through model tokens, tool calls, retries, and code execution before producing an answer. When a run fai…
Eval & Observability 2606.01365 notes β†’ πŸ’¬ Tier 2. λ©€ν‹°μ—μ΄μ „νŠΈ λ‚­λΉ„ μ—°μ‚° μ‘°κΈ° 진단 β€” μ‹€νŒ¨ 인식 κ΄€μΈ‘κ°€λŠ₯μ„±. 싀행이 회볡 가…