AI Agent Papers

AutoNumerics: An Autonomous, PDE-Agnostic Multi-Agent Pipeline for Scientific Computing

A multi-agent pipeline that reads a PDE problem description in plain text and writes, debugs, and validates a classical numerical solver end-to-end. Generates spectral and finite-d…

Multi-Agent 2602.17607 notes →

Beyond Offline A/B Testing: Context-Aware Agent Simulation for Recommender System Evaluation

Evaluates recommender systems via agent-RS interactions.

Multi-Agent 2604.09549 notes →

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

Introduces long-running multi-agent systems that self-evolve via shared persistent memory, asynchronous execution, and heartbeat-based interventions; 3–10× higher improvement rates…

Multi-Agent 2604.01658 notes →

DyTopo: Dynamic Topology Routing for Multi-Agent Reasoning via Semantic Matching

Investigates dynamically rewiring agent-to-agent connections at each reasoning round via semantic matching instead of fixed communication topologies.

Multi-Agent 2602.06039 notes →

RuleSmith: Multi-Agent LLMs for Automated Game Balancing

Explores automated game balancing by combining multi-agent LLM self-play with Bayesian optimization on a civ-style game.

Multi-Agent 2602.06232 notes →

CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction

Examines how conformal prediction can filter noisy inter-agent messages to improve multi-robot coordination.

Multi-Agent 2602.06038 notes →

AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Introduces a 110+ task benchmark to evaluate how well multi-agent LLM systems handle buyer-seller negotiation through natural language.

Multi-Agent 2602.06008 notes →

Gender Dynamics and Homophily in a Social Network of LLM Agents

Analyzes social network formation among 70K+ autonomous LLM agents on Chirper.ai to study emergent group behavior and bias.

Multi-Agent 2602.02606 notes →

ROMA: Recursive Open Meta-Agent Framework for Long-Horizon Multi-Agent Systems

Proposes breaking large tasks into subtask trees that run in parallel across multiple agents to handle long-horizon workflows without exceeding context windows.

Multi-Agent 2602.01848 notes →

ORCH: many analyses, one merge — a deterministic multi-agent orchestrator

Proposes a deterministic multi-agent orchestrator where multiple LLMs analyze a problem independently and a merge agent selects the best answer without any training.

Multi-Agent 2602.01797 notes →

H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows

Simulates end-to-end hospital administrative workflows with multi-agent LLMs and FHIR integration to test LLM-driven automation in healthcare settings.

Multi-Agent 2602.05407 notes →

Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering

Proposes a multi-agent system for autonomous software engineering that assigns specialized agents to roles like coordination, research, implementation, and review.

Multi-Agent 2602.01465 notes →

Multi-Agent Teams Hold Experts Back

Examines whether self-organizing LLM agent teams can match or beat their best member's performance across collaborative benchmarks.

Multi-Agent 2602.01011 notes →

Evolving Interpretable Constitutions for Multi-Agent Coordination

Explores using LLM-driven genetic programming to automatically discover behavioral norms for multi-agent coordination in a survival-pressure grid-world simulation.

Multi-Agent 2602.00755 notes →

Scaling Multiagent Systems with Process Rewards

Proposes per-action process rewards from AI feedback to improve credit assignment and sample efficiency when finetuning multi-agent LLM systems.

Multi-Agent 2601.23228 notes →

MonoScale: Scaling Multi-Agent System with Monotonic Improvement

Proposes a framework for safely growing multi-agent pools by generating familiarization tasks and building routing memory, with a guaranteed non-decreasing performance across onboa…

Multi-Agent 2601.23219 notes →

Task-Aware LLM Council with Adaptive Decision Pathways for Decision Support

Proposes a task-adaptive multi-agent framework that routes control to the most suitable LLM at each decision step using semantic matching against each model's success history.

Multi-Agent 2601.22662 notes →

SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous Language Model Assembly

Explores using a pool of different LLM agents within MCTS planning to increase rollout diversity and improve multi-step reasoning.

Multi-Agent 2601.22623 notes →

Learning to Recommend Multi-Agent Subgraphs from Calling Trees

Proposes a recommendation framework that uses historical calling trees to select the best agents or agent teams for each subtask in multi-agent orchestration.

Multi-Agent 2601.22209 notes →

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Investigates actor-critic reinforcement learning methods for training decentralized LLM agent collaboration across writing, coding, and game-playing tasks.

Multi-Agent 2601.21972 notes →

AgenticSimLaw: A Juvenile Courtroom Multi-Agent Debate Simulation for Explainable High-Stakes Tabular Decision Making

Proposes a role-structured multi-agent courtroom debate framework with defined agent roles, interaction protocols, and private reasoning strategies for auditable high-stakes decisi…

Multi-Agent 2601.21936 notes →

Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems

Introduces a reasoning framework that builds peer reliability profiles from interaction history so agents in multi-agent systems learn which peers to trust when uncertain.

Multi-Agent 2601.21742 notes →

Adaptive Confidence Gating in Multi-Agent Collaboration for Efficient and Optimized Code Generation

Explores structured multi-agent debate with three role-based agents and adaptive confidence gating to improve small language model code generation.

Multi-Agent 2601.21469 notes →

CASTER: Context-Aware Strategy for Task Efficient Routing in Multi-Agent Systems

Proposes a lightweight router for dynamic model selection in graph-based multi-agent systems that combines semantic embeddings with structural meta-features and self-optimizes thro…

Multi-Agent 2601.19793 notes →

Phase Transition for Budgeted Multi-Agent Synergy

Develops a theory for predicting when budgeted multi-agent LLM systems improve, saturate, or collapse based on context windows, communication fidelity, and shared-error correlation…

Multi-Agent 2601.17311 notes →

Dynamic Role Assignment for Multi-Agent Debate

Proposes a meta-debate framework that dynamically assigns roles in multi-agent systems by matching model capabilities to positions through proposal and peer review stages.

Multi-Agent 2601.17152 notes →

Learning to Collaborate: An Orchestrated-Decentralized Framework for Peer-to-Peer LLM Federation

Introduces orchestrated decentralized peer-to-peer LLM collaboration that uses contextual bandits to learn optimal matchmaking between heterogeneous agents via secure distillation.

Multi-Agent 2601.17133 notes →

Mixture-of-Models: Unifying Heterogeneous Agents via N-Way Self-Evaluating Deliberation

Explores a runtime Mixture-of-Models architecture with a dynamic expertise broker and quadratic voting consensus that enables small model ensembles to match frontier performance.

Multi-Agent 2601.16863 notes →

Multi-Agent Constraint Factorization Reveals Latent Invariant Solution Structure

Formalizes through operator theory why multi-agent LLM systems access invariant solutions that a single agent applying all constraints simultaneously cannot reach.

Multi-Agent 2601.15077 notes →

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

Proposes a training-time framework that formulates multi-agent orchestration as function-calling reinforcement learning with holistic system-level reasoning and introduces MASBENCH…

Multi-Agent 2601.14652 notes →

MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems

Proposes a bi-level optimization framework for multi-agent companions that aligns individual personas via RLAIF and optimizes collaborative dialogue through group-level meta-policy…

Multi-Agent 2601.14230 notes →

If You Want Coherence, Orchestrate a Team of Rivals: Multi-Agent Models of Organizational Intelligence

Explores a team-of-rivals multi-agent architecture with specialized roles and a remote code executor that separates reasoning from data execution to maintain clean context windows.

Multi-Agent 2601.14351 notes →

The Orchestration of Multi-Agent Systems: Architectures, Protocols, and Enterprise Adoption

Formalizes a unified architectural framework for orchestrated multi-agent systems integrating MCP for tool access and Agent2Agent protocol for peer coordination, delegation, and po…

Multi-Agent 2601.13671 notes →

MARO: Learning Stronger Reasoning from Social Interaction

Proposes Multi-Agent Reward Optimization, a method that decomposes multi-agent social interaction outcomes into per-behavior learning signals to improve LLM reasoning through simul…

Multi-Agent 2601.12323 notes →

LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding

Introduces an LSTM-inspired multi-agent architecture with worker, filter, judge, and manager agents that emulate gated memory mechanisms to control information flow for long-contex…

Multi-Agent 2601.11913 notes →

Do We Always Need Query-Level Workflows? Rethinking Agentic Workflow Generation for Multi-Agent Systems

Examines whether query-level workflow generation is always necessary in multi-agent systems and proposes a low-cost task-level framework that uses self-prediction with few-shot cal…

Multi-Agent 2601.11147 notes →

Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems

Proposes a latency-aware multi-agent orchestration framework that explicitly optimizes the critical execution path under parallel execution to reduce end-to-end latency while maint…

Multi-Agent 2601.10560 notes →

TopoDIM: One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems

Proposes a one-shot topology generation framework with diverse interaction modes that enables decentralized agents to autonomously construct heterogeneous communication topologies …

Multi-Agent 2601.10120 notes →

Beyond Rule-Based Workflows: An Information-Flow-Orchestrated Multi-Agents Paradigm via A2A Communication from CORAL

Replaces predefined multi-agent workflows with a dynamic information-flow orchestrator that coordinates agents through natural-language A2A communication.

Multi-Agent 2601.09883 notes →

LLM-Based Agentic Systems for Software Engineering: Challenges and Opportunities

Reviews LLM-based multi-agent systems across the software development lifecycle, covering frameworks, communication protocols, and orchestration challenges from requirements to deb…

Multi-Agent 2601.09822 notes →

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Explores injecting structured textual experience into multi-agent deliberation at test time to improve reasoning accuracy without any model tuning.

Multi-Agent 2601.09667 notes →

The End of Reward Engineering: How LLMs Are Redefining Multi-Agent Coordination

Argues that LLMs can replace hand-crafted numerical reward functions with language-based objective specifications for multi-agent coordination, drawing on EUREKA and RLVR as eviden…

Multi-Agent 2601.08237 notes →

A Large-Scale Study on the Development and Issues of Multi-Agent AI Systems

Analyzes over 42K commits and 4.7K resolved issues across eight leading multi-agent AI systems (LangChain, CrewAI, AutoGen, etc.) to study development patterns, maintenance practic…

Multi-Agent 2601.07136 notes →

StackPlanner: A Centralized Hierarchical Multi-Agent System with Task-Experience Memory Management

Proposes a hierarchical multi-agent framework that decouples high-level coordination from subtask execution with active task-level memory control and reinforcement-learning-driven …

Multi-Agent 2601.05890 notes →

CTHA: Constrained Temporal Hierarchical Architecture for Stable Multi-Agent LLM Systems

Proposes a constrained temporal hierarchical architecture for multi-agent LLM systems that projects inter-layer communication onto structured manifolds with typed message contracts…

Multi-Agent 2601.10738 notes →

DynaDebate: Breaking Homogeneity in Multi-Agent Debate with Dynamic Path Generation

Introduces dynamic path generation for multi-agent debate that allocates diverse solution paths to agents, shifts focus to step-by-step logic critique, and uses a trigger-based ver…

Multi-Agent 2601.05746 notes →

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Investigates how diversity-aware initialization and confidence-modulated updates improve multi-agent debate, connecting findings from human deliberation research to LLM-based debat…

Multi-Agent 2601.19921 notes →

Orchestrating Intelligence: Confidence-Aware Routing for Multi-Agent Collaboration

Proposes a multi-agent framework with confidence-aware routing that dynamically selects agent roles and model scales across heterogeneous LLMs based on task complexity.

Multi-Agent 2601.04861 notes →

Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework

Analyzes role-based authority bias in multi-agent evaluation frameworks using French and Raven's power-based theory across legitimate, referent, and expert power types.

Multi-Agent 2601.04790 notes →

When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail

Investigates when a single agent with a skill library can replace multi-agent systems, studying scaling limits and phase transitions in skill selection as libraries grow.

Multi-Agent 2601.04748 notes →

ResMAS: Resilience Optimization in LLM-based Multi-Agent Systems

Proposes a two-stage framework for enhancing multi-agent system resilience through RL-based topology generation and topology-aware prompt optimization under perturbations.

Multi-Agent 2601.04694 notes →

TCAndon-Router: Adaptive Reasoning Router for Multi-Agent Collaboration

Proposes an adaptive reasoning router for multi-agent systems that generates natural-language reasoning chains before predicting candidate agents, with a collaborative execution pi…

Multi-Agent 2601.04544 notes →

When Numbers Start Talking: Implicit Numerical Coordination Among LLM-Based Agents

Investigates covert communication in LLM multi-agent systems through game-theoretic analysis of implicit coordination signals across different communication regimes.

Multi-Agent 2601.03846 notes →

Bayesian Orchestration of Multi-LLM Agents for Cost-Aware Sequential Decision-Making

Proposes a Bayesian, cost-aware multi-LLM orchestration framework that treats LLMs as approximate likelihood models and aggregates across diverse models for sequential decision-mak…

Multi-Agent 2601.01522 notes →

OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents

Turns natural-language optimization problems into working solver code with a four-agent pipeline (Formulator, Planner, Coder, Critic) and UCB bandit scheduling over candidate formu…

Multi-Agent 2504.16918 notes →

Corpus2Skill: Don't Retrieve, Navigate — Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

Compiles a corpus offline into a hierarchical tree of Agent Skills that the LLM agent navigates at query time, replacing retrieval with skill-tree traversal.

Memory & RAG 2604.14572 notes →

BudgetMem: Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Investigates routing agent memory queries to different processing tiers based on query difficulty to control the cost-accuracy trade-off at runtime.

Memory & RAG 2602.06025 notes →

Learning to Share: Selective Memory for Efficient Parallel Agentic Systems

Proposes a shared memory bank with a learned controller that decides what information is worth passing between parallel agent teams to reduce redundant work.

Memory & RAG 2602.05965 notes →

CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering

Explores converting a corpus into atomic QA pairs offline to resolve multi-hop questions with just two LLM calls regardless of hop count.

Memory & RAG 2602.05728 notes →

Mitigating Hallucination in Financial Retrieval-Augmented Generation via Fine-Grained Knowledge Verification

Examines breaking financial RAG answers into atomic facts and verifying each against retrieved documents using reinforcement learning rewards.

Memory & RAG 2602.05723 notes →

Graph-based Agent Memory: Taxonomy, Techniques, and Applications

Surveys graph-based memory architectures for agents, covering extraction, storage, retrieval, and how memory evolves over time.

Memory & RAG 2602.05665 notes →

AI Agent Systems for Supply Chains: Structured Decision Prompts and Memory Retrieval

Proposes a multi-agent system for inventory management that retrieves similar past decisions to adapt ordering across various supply chain scenarios.

Memory & RAG 2602.05524 notes →

SOPRAG: Multi-view Graph Experts Retrieval for Industrial Standard Operating Procedures

Explores replacing flat chunk-based RAG with graph experts that understand entity relationships, causality, and process flows for structured documents like SOPs.

Memory & RAG 2602.01858 notes →

ProcMEM: Learning Reusable Procedural Memory from Experience via Non-Parametric PPO for LLM Agents

Investigates letting agents save step-by-step procedural skills from past runs and reuse them later without retraining to reduce repeated computation.

Memory & RAG 2602.01869 notes →

Aggregation Queries over Unstructured Text: Benchmark and Agentic Method

Proposes an agentic method for aggregation queries over unstructured text that tries to find all matching evidence, breaking the task into disambiguation, filtering, and aggregatio…

Memory & RAG 2602.01355 notes →

DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking

Proposes an agentic RAG framework that uses reflection and memory-based refinement to generate diverse answers for open-ended questions.

Memory & RAG 2602.00238 notes →

JADE: Bridging the Strategic-Operational Gap in Dynamic Agentic RAG

Proposes joint optimization of planning and execution in agentic RAG by modeling the system as a cooperative multi-agent team with shared backbone and outcome-based rewards.

Memory & RAG 2601.21916 notes →

ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation

Proposes process-supervised reinforcement learning for RAG that uses MCTS-based step-level rewards to identify and fix flawed reasoning steps in multi-hop retrieval.

Memory & RAG 2601.21912 notes →

E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory

Introduces an episodic memory framework where assistant agents maintain uncompressed memory contexts while a master agent orchestrates global planning, replacing destructive memory…

Memory & RAG 2601.21714 notes →

ShardMemo: Masked MoE Routing for Sharded Agentic LLM Memory

Proposes a tiered memory service for agentic LLM systems that uses masked mixture-of-experts routing to probe only eligible memory shards under a fixed budget.

Memory & RAG 2601.21545 notes →

When should I search more: Adaptive Complex Query Optimization with Reinforcement Learning

Explores adaptive query optimization in RAG using reinforcement learning to dynamically decide when to split complex queries into sub-queries and fuse the retrieved results.

Memory & RAG 2601.21208 notes →

A2RAG: Adaptive Agentic Graph Retrieval for Cost-Aware and Reliable Reasoning

Introduces an adaptive agentic Graph-RAG framework that verifies evidence sufficiency and progressively escalates retrieval effort, mapping graph signals back to source text to han…

Memory & RAG 2601.21162 notes →

MemCtrl: Using MLLMs as Active Memory Controllers on Embodied Agents

Investigates augmenting multimodal LLMs with a trainable memory gate that decides which observations to retain, update, or discard during online embodied agent exploration.

Memory & RAG 2601.20831 notes →

AMA: Adaptive Memory via Multi-Agent Collaboration

Proposes a multi-agent memory framework with hierarchical granularity, adaptive query routing, consistency verification, and targeted memory refresh for long-term agent interaction…

Memory & RAG 2601.20352 notes →

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Examines when iterative retrieval-reasoning loops outperform static gold-context RAG in scientific multi-hop QA, diagnosing failure modes across retrieval coverage, hypothesis drif…

Memory & RAG 2601.19827 notes →

Dep-Search: Learning Dependency-Aware Reasoning Traces with Persistent Memory

Introduces a dependency-aware search framework that uses GRPO reinforcement learning to teach LLMs to decompose questions with dependency relationships and store intermediate resul…

Memory & RAG 2601.18771 notes →

FadeMem: Biologically-Inspired Forgetting for Efficient Agent Memory

Proposes a biologically-inspired agent memory architecture with adaptive exponential decay, LLM-guided conflict resolution, and intelligent memory fusion across a dual-layer hierar…

Memory & RAG 2601.18642 notes →

FastInsight: Fast and Insightful Retrieval via Fusion Operators for Graph RAG

Explores two fusion operators for Graph RAG that combine graph-aware reranking with semantic-topological expansion to improve retrieval accuracy and generation quality.

Memory & RAG 2601.18579 notes →

Less is More for RAG: Information Gain Pruning for Generator-Aligned Reranking and Evidence Selection

Proposes a generator-aligned reranking and pruning module for RAG that selects evidence using utility signals and filters weak or harmful passages before context truncation.

Memory & RAG 2601.17532 notes →

DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering

Introduces a step-by-step reasoning reranking agent for RAG that distinguishes semantically similar but logically irrelevant passages in retrieval-augmented question answering.

Memory & RAG 2601.16478 notes →

SPARC-RAG: Adaptive Sequential-Parallel Scaling with Context Management for Retrieval-Augmented Generation

Introduces a multi-agent RAG framework that coordinates sequential and parallel inference-time scaling under unified context management to prevent contamination and improve multi-h…

Memory & RAG 2602.00083 notes →

Incorporating Q&A Nuggets into Retrieval-Augmented Generation

Proposes a nugget-augmented generation system that constructs a bank of Q&A nuggets from retrieved documents to guide extraction, selection, and report generation with citation pro…

Memory & RAG 2601.13222 notes →

Augmenting Question Answering with A Hybrid RAG Approach

Introduces a hybrid RAG architecture combining query augmentation, agentic routing, and structured retrieval that merges vector and graph-based techniques with context unification …

Memory & RAG 2601.12658 notes →

Utilizing Metadata for Better Retrieval-Augmented Generation

Presents a systematic study of metadata-aware retrieval strategies for RAG, comparing prefix, suffix, unified embedding, and late-fusion approaches with field-level ablations on em…

Memory & RAG 2601.11863 notes →

Deep GraphRAG: A Balanced Approach to Hierarchical Retrieval and Adaptive Integration

Proposes a hierarchical global-to-local retrieval strategy for GraphRAG with beam search-optimized re-ranking and a compact LLM integration module trained via dynamic-weighting rei…

Memory & RAG 2601.11144 notes →

Grounding Agent Memory in Contextual Intent

Introduces an agentic memory system that indexes trajectory steps with structured contextual intent cues and retrieves history by intent compatibility to reduce interference in lon…

Memory & RAG 2601.10702 notes →

Structure and Diversity Aware Context Bubble Construction for Enterprise Retrieval Augmented Systems

Proposes a structure-informed and diversity-constrained context bubble construction framework for RAG that preserves document structure and balances relevance, coverage, and redund…

Memory & RAG 2601.10681 notes →

Topo-RAG: Topology-aware retrieval for hybrid text-table documents

Introduces a dual-architecture RAG framework that routes narrative through dense retrievers and tabular data through a cell-aware late interaction mechanism to preserve spatial rel…

Memory & RAG 2601.10215 notes →

Continuum Memory Architectures for Long-Horizon LLM Agents

Defines a class of memory systems for long-horizon agents that maintain persistent, temporally chained internal state instead of stateless RAG lookups, specifying the architectural…

Memory & RAG 2601.09913 notes →

Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey

Surveys foundation agent memory organized by substrate (internal/external), cognitive mechanism (episodic, semantic, working, procedural), and subject (agent- vs user-centric).

Memory & RAG 2602.06052 notes →

The AI Hippocampus: How Far are We From Human Memory?

Surveys memory in LLMs and multimodal LLMs across implicit, explicit, and agentic paradigms, covering cross-modal integration and challenges like capacity, alignment, and factual c…

Memory & RAG 2601.09113 notes →

AtomMem: Learnable Dynamic Agentic Memory with Atomic Memory Operation

Decomposes memory management into atomic CRUD operations and learns an autonomous policy via SFT + RL to study whether learnable memory outperforms static-workflow methods on long-…

Memory & RAG 2601.08323 notes →

OpenDecoder: Open LLM Decoding to Incorporate Document Quality in RAG

Feeds explicit document quality signals (relevance score, ranking, QPP) into RAG generation to study whether exposing retrieval metadata makes the model more robust to noisy contex…

Memory & RAG 2601.09028 notes →

Reliable Graph-RAG for Codebases: AST-Derived Graphs vs LLM-Extracted Knowledge Graphs

Benchmarks vector-only, LLM-extracted KG, and AST-derived graph pipelines for code RAG, comparing correctness and indexing cost across deterministic and LLM-based graph constructio…

Memory & RAG 2601.08773 notes →

To Retrieve or To Think? An Agentic Approach for Context Evolution

Proposes an agentic RAG framework that dynamically decides whether to retrieve new evidence or reason over existing context at each step, aiming to eliminate redundant retrieval.

Memory & RAG 2601.08747 notes →

Parallel Context-of-Experts Decoding for Retrieval Augmented Generation

Proposes a training-free RAG decoding method that treats retrieved documents as isolated "experts" and aggregates their logits via retrieval-aware contrastive decoding to recover c…

Memory & RAG 2601.08670 notes →

SwiftMem: Fast Agentic Memory via Query-aware Indexing

Proposes a query-aware agentic memory system that achieves sub-linear retrieval through temporal and semantic DAG-Tag indexing with an embedding-tag co-consolidation mechanism for …

Memory & RAG 2601.08160 notes →

Learning How to Remember: A Meta-Cognitive Management Method for Structured and Transferable Agent Memory

Proposes treating memory abstraction as a learnable cognitive skill, training a memory copilot via DPO to determine how memories should be structured, abstracted, and reused across…

Memory & RAG 2601.07470 notes →

Beyond Dialogue Time: Temporal Semantic Memory for Personalized LLM Agents

Introduces a temporal semantic memory framework that organizes memories by actual occurrence time rather than dialogue time and consolidates temporally continuous information into …

Memory & RAG 2601.07468 notes →

Active Context Compression: Autonomous Memory Management in LLM Agents

Proposes an agent-centric architecture inspired by Physarum polycephalum where the agent autonomously decides when to consolidate learnings and prune raw interaction history to man…

Memory & RAG 2601.07190 notes →

Relink: Constructing Query-Driven Evidence Graph On-the-Fly for GraphRAG

Proposes a reason-and-construct paradigm for GraphRAG that dynamically builds query-specific evidence graphs by instantiating facts from a latent relation pool and discarding distr…

Memory & RAG 2601.07192 notes →

Seeing through the Conflict: Transparent Knowledge Conflict Handling in RAG

Introduces a plug-and-play RAG framework that disentangles semantic match from factual consistency and estimates self-answerability to make the conflict-resolution decision process…

Memory & RAG 2601.06842 notes →

CIRAG: Construction-Integration Retrieval and Adaptive Generation for Multi-hop Question Answering

Proposes a construction-integration approach for multi-hop RAG that preserves multiple evidence chains via iterative triple construction and adaptively expands context granularity …

Memory & RAG 2601.06799 notes →

Amory: Building Coherent Narrative-Driven Agent Memory through Agentic Reasoning

Proposes a working memory framework that constructs structured episodic narratives from conversational fragments, consolidates memories with momentum, and semanticizes peripheral f…

Memory & RAG 2601.06282 notes →

L-RAG: Balancing Context and Retrieval with Entropy-Based Lazy Loading

Proposes an adaptive RAG framework that uses entropy-based gating to bypass vector database retrieval when model uncertainty is low, triggering expensive chunk retrieval only when …

Memory & RAG 2601.06551 notes →

PRISMA: Reinforcement Learning Guided Two-Stage Policy Optimization in Multi-Agent Architecture for Open-Domain Multi-Hop QA

Proposes a decoupled multi-agent RAG framework for multi-hop QA with a Plan-Retrieve-Inspect-Solve-Memoize architecture and two-stage GRPO optimization to address retrieval collaps…

Memory & RAG 2601.05465 notes →

Controllable Memory Usage: Balancing Anchoring and Innovation in Long-Term Human-Agent Interaction

Proposes a framework for user-controllable memory reliance in long-term agent interactions, modeling memory dependence as an explicit and steerable dimension.

Memory & RAG 2601.05107 notes →

Beyond Static Summarization: Proactive Memory Extraction for LLM Agents

Proposes proactive memory extraction using self-questioning feedback loops instead of one-off static summarization to recover missing information and correct errors iteratively.

Memory & RAG 2601.04463 notes →

Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents

Proposes a hierarchical memory architecture with a Topic Loom that groups consecutive same-topic dialogue turns into coherent memory boxes and links them via long-range event-timel…

Memory & RAG 2601.03785 notes →

MAGMA: A Multi-Graph based Agentic Memory Architecture

Proposes a multi-graph agentic memory architecture that represents memories across orthogonal semantic, temporal, causal, and entity graphs with policy-guided traversal for retriev…

Memory & RAG 2601.03236 notes →

HiMeS: Hippocampus-inspired Memory System for Personalized AI Assistants

Proposes a hippocampus-inspired memory architecture for AI assistants that fuses RL-trained short-term memory extraction with partitioned long-term memory for personalization.

Memory & RAG 2601.06152 notes →

SimpleMem: Efficient Lifelong Memory for LLM Agents

Proposes a three-stage memory framework based on semantic lossless compression with structured compression, online semantic synthesis, and intent-aware retrieval planning.

Memory & RAG 2601.02553 notes →

ClawBench: Evaluating Browser Agents on Live Production Websites with Submission-Interception

Benchmarks browser agents on 283 everyday tasks (V1 153 + V2 130) across 163 live production sites, with a Chrome-extension plus CDP layer that blocks only the final write request …

Eval & Observability 2604.08523 notes →

From Features to Actions: Explainability in Traditional and Agentic AI Systems

Compares attribution-based explanations with trace-based diagnostics across static and agentic settings to study how explainability methods translate to multi-step agent trajectori…

Eval & Observability 2602.06841 notes →

Agentic Uncertainty Reveals Agentic Overconfidence

Investigates whether agents can accurately predict their own success rates in agentic tasks.

Eval & Observability 2602.06948 notes →

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Introduces 20 research tasks from real ML papers covering idea generation, experiments, and refinement for benchmarking science agents.

Eval & Observability 2602.06855 notes →

JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks

Proposes evaluating agent outputs by decomposing responses into individual claims and checking each against expert knowledge.

Eval & Observability 2602.06486 notes →

Completing Missing Annotation: Multi-Agent Debate for Accurate Relevant Assessment

Explores using multi-agent debate to fill missing labels in information retrieval benchmarks.

Eval & Observability 2602.06526 notes →

TrajAD: Trajectory Anomaly Detection for Trustworthy LLM Agents

Proposes a specialized verifier that detects and locates errors in agent execution trajectories at runtime to enable precise rollback-and-retry.

Eval & Observability 2602.06443 notes →

Emulating Aggregate Human Choice Behavior and Biases with GPT Conversational Agents

Examines whether GPT-4/5 agents can reproduce aggregate human cognitive biases in interactive decision-making scenarios.

Eval & Observability 2602.05597 notes →

Capture the Flags: Family-Based Evaluation of Agentic LLMs

Proposes generating families of equivalent CTF challenges through code transformations to test whether agents truly understand exploits or just memorize patterns.

Eval & Observability 2602.05523 notes →

PieArena: Frontier Language Agents Achieve MBA-Level Negotiation

Introduces a negotiation benchmark where frontier LLM agents are evaluated against MBA students to reveal cross-model differences in deception, accuracy, and trustworthiness.

Eval & Observability 2602.05302 notes →

ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support

Benchmarks how well conversational agents retain and use personal information over long emotional support conversations.

Eval & Observability 2602.01885 notes →

HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

Introduces a benchmark that replays published human-subject experiments with LLM agents to test how well they simulate real participants.

Eval & Observability 2602.00685 notes →

Benchmarking Agents in Insurance Underwriting Environments

Proposes an expert-designed multi-turn insurance underwriting benchmark to evaluate agent performance under real-world enterprise conditions with noisy tools and proprietary knowle…

Eval & Observability 2602.00456 notes →

TriCEGAR: A Trace-Driven Abstraction Mechanism for Agentic AI

Proposes automated state abstraction from agent execution traces using predicate trees and counterexample refinement for probabilistic runtime verification of agent behavior.

Eval & Observability 2601.22997 notes →

Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

Compares three LLM agent frameworks (Aider, OpenHands, SWE-agent) on vulnerability false positive filtering to study how agent design and backbone model affect triage performance.

Eval & Observability 2601.22952 notes →

Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study

Analyzes 8,106 fix-related pull requests from five AI coding agents to catalog the reasons agent-generated contributions are closed without merging.

Eval & Observability 2602.00164 notes →

JAF: Judge Agent Forest

Proposes a judge agent framework that evaluates query-response pairs jointly across a cohort rather than in isolation, using in-context neighborhoods for cross-instance pattern det…

Eval & Observability 2601.22269 notes →

Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis

Evaluates LLM reasoning under ReAct and Plan-and-Execute agentic workflows across 48,000 simulated failure scenarios, producing a taxonomy of 16 common reasoning failures.

Eval & Observability 2601.22208 notes →

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Introduces a benchmark for evaluating LLM agent consistency, uncertainty handling, and capability awareness in multi-turn tool-using scenarios with incomplete or ambiguous user req…

Eval & Observability 2601.22027 notes →

More Code, Less Reuse: Investigating Code Quality and Reviewer Sentiment towards AI-generated Pull Requests

Examines code quality, maintainability, and reviewer sentiment toward AI-agent-generated pull requests compared to human-authored contributions.

Eval & Observability 2601.21276 notes →

The Quiet Contributions: Insights into AI-Generated Silent Pull Requests

Analyzes silent (no-comment) AI-generated pull requests to examine their impact on code complexity, quality issues, and security vulnerabilities.

Eval & Observability 2601.21102 notes →

Agent Benchmarks Fail Public Sector Requirements

Analyzes over 1,300 agent benchmarks against public-sector requirements including process-based evaluation, realism, and domain-specific metrics.

Eval & Observability 2601.20617 notes →

Interpreting Emergent Extreme Events in Multi-Agent Systems

Applies Shapley values to attribute emergent extreme events in LLM multi-agent systems to specific agent actions across time, agent, and behavior dimensions.

Eval & Observability 2601.20538 notes →

Who Writes the Docs in SE 3.0? Agent vs. Human Documentation Pull Requests

Analyzes AI agent contributions to documentation pull requests and examines how human developers review and intervene in agent-authored documentation changes.

Eval & Observability 2601.20171 notes →

Are We All Using Agents the Same Way? An Empirical Study of Core and Peripheral Developers Use of Coding Agents

Examines how core and peripheral developers differ in their use, review, modification, and verification of coding-agent-generated pull requests.

Eval & Observability 2601.20106 notes →

DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle

Introduces an end-to-end benchmark with 700+ real-world tasks across build, monitoring, issue resolving, and test generation for evaluating AI agents in full software DevOps workfl…

Eval & Observability 2601.20882 notes →

Toward Architecture-Aware Evaluation Metrics for LLM Agents

Proposes an architecture-informed evaluation approach that links agent components like planners, memory, and tool routers to observable behaviors and diagnostic metrics.

Eval & Observability 2601.19583 notes →

Balancing Sustainability And Performance: The Role Of Small-Scale LLMs In Agentic AI Systems

Investigates whether smaller-scale language models can reduce energy consumption in multi-agent agentic AI systems without compromising task quality.

Eval & Observability 2601.19311 notes →

Understanding Dominant Themes in Reviewing Agentic AI-authored Code

Analyzes 19,450 inline review comments on agent-authored pull requests and derives a taxonomy of 12 review themes to understand how reviewers respond to AI-generated code.

Eval & Observability 2601.19287 notes →

Let's Make Every Pull Request Meaningful: An Empirical Analysis of Developer and Agentic Pull Requests

Analyzes 40,214 developer and agentic pull requests to compare merge outcomes and identify how submitter attributes and review features differ between human and AI agent contributi…

Eval & Observability 2601.18749 notes →

Automated Structural Testing of LLM-Based Agents: Methods, Framework, and Case Studies

Presents structural testing methods for LLM-based agents using OpenTelemetry traces, mocking for reproducible behavior, and automated assertions for component-level verification.

Eval & Observability 2601.18827 notes →

When AI Agents Touch CI/CD Configurations: Frequency and Success

Analyzes how five AI coding agents interact with CI/CD configurations across 8,031 pull requests, examining modification frequency, merge rates, and build success.

Eval & Observability 2601.17413 notes →

Fingerprinting AI Coding Agents on GitHub

Identifies behavioral signatures of five AI coding agents from 33,580 pull requests using commit, PR structure, and code features for agent attribution.

Eval & Observability 2601.17406 notes →

Interpreting Agentic Systems: Beyond Model Explanations to System-Level Accountability

Assesses existing interpretability methods for agentic systems and identifies gaps in explaining temporal dynamics, compounding decisions, and context-dependent behaviors.

Eval & Observability 2601.17168 notes →

AI builds, We Analyze: An Empirical Study of AI-Generated Build Code Quality

Investigates maintainability and security-related build code smells in AI-agent-generated pull requests across 364 identified quality issues.

Eval & Observability 2601.16839 notes →

Will It Survive? Deciphering the Fate of AI-Generated Code in Open Source

Examines long-term survival of AI-agent-generated code through survival analysis of 200,000+ code units across 201 open-source projects.

Eval & Observability 2601.16809 notes →

LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents

Develops an oracle counterfactual framework for multi-turn agentic tasks that measures the criticality of individual capabilities like planning and state tracking.

Eval & Observability 2601.16649 notes →

When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems

Presents a 12-category error taxonomy and diagnostic framework for evaluating tool-use reliability across open-weight and proprietary LLMs in multi-agent systems on edge hardware.

Eval & Observability 2601.16280 notes →

Agentic Confidence Calibration

Introduces the problem of agentic confidence calibration and proposes Holistic Trajectory Calibration, extracting process-level features across an agent's entire trajectory to diag…

Eval & Observability 2601.15778 notes →

Improving Methodologies for Agentic Evaluations Across Domains: Leakage of Sensitive Information, Fraud and Cybersecurity Threats

Examines methodological challenges in evaluating AI agents across sensitive information leakage, fraud, and cybersecurity threats through a multi-national collaborative benchmarkin…

Eval & Observability 2601.15679 notes →

MiRAGE: A Multiagent Framework for Generating Multimodal Multihop Question-Answer Dataset for RAG Evaluation

Introduces a multi-agent framework that generates verified, domain-specific, multimodal, multi-hop question-answer datasets for benchmarking retrieval-augmented generation systems.

Eval & Observability 2601.15487 notes →

When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

Analyzes 1,187 bug reports from LLM agent software across seven frameworks to categorize bug types, root causes, effects, and tests automated bug labeling with a ReAct agent.

Eval & Observability 2601.15232 notes →

The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution

Proposes a hierarchical framework for general agentic attribution that identifies internal factors driving agent actions through temporal likelihood dynamics and perturbation-based…

Eval & Observability 2601.15075 notes →

Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering

Analyzes token consumption patterns across software development lifecycle stages in a multi-agent system to identify where tokens are consumed and which stages drive cost.

Eval & Observability 2601.14470 notes →

APEX-Agents

Introduces a benchmark of 480 long-horizon, cross-application productivity tasks created by investment banking analysts, consultants, and lawyers for evaluating AI agent capabiliti…

Eval & Observability 2601.14242 notes →

CooperBench: Why Coding Agents Cannot be Your Teammates Yet

Introduces a benchmark of 600+ collaborative coding tasks to evaluate whether coding agents can coordinate as effective teammates under various coordination structures.

Eval & Observability 2601.13295 notes →

Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?

Investigates how RAG systems can game nugget-based LLM judge evaluations through metric overfitting, demonstrating near-perfect scores when evaluation elements are leaked or predic…

Eval & Observability 2601.13227 notes →

Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

Introduces the Determinism-Faithfulness Assurance Harness for measuring trajectory determinism and evidence-conditioned faithfulness in tool-using LLM agents across 74 configuratio…

Eval & Observability 2601.15322 notes →

AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems

Presents a process-aware and auditable multi-agent evaluation framework that plans, executes, and aggregates multi-step evaluations across heterogeneous agentic workflows under hum…

Eval & Observability 2601.11903 notes →

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Introduces a curated benchmark of 89 hard tasks in computer terminal environments with unique environments, human-written solutions, and comprehensive tests for evaluating frontier…

Eval & Observability 2601.11868 notes →

ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue Systems

Introduces a benchmark and evaluation framework for agentic task-oriented dialogue systems covering multi-goal coordination, dependency management, memory, adaptability, and proact…

Eval & Observability 2601.11854 notes →

What Do LLM Agents Know About Their World? Task2Quiz

Decouples task execution from environment understanding with a deterministic QA paradigm to study whether task success is actually a good proxy for how well agents understand their…

Eval & Observability 2601.09503 notes →

The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments

Evaluates frontier models on 150 workplace tasks to identify an empirical hierarchy of agentic capabilities spanning tool use, planning, adaptability, groundedness, and common-sens…

Eval & Observability 2601.09032 notes →

ViDoRe V3: A Comprehensive Evaluation of RAG in Complex Real-World Scenarios

Introduces a multimodal RAG benchmark with 26K pages and 3,099 queries in 6 languages to evaluate retrieval across non-textual elements and open-ended queries.

Eval & Observability 2601.08620 notes →

M3-BENCH: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games

Evaluates LLM agent social behaviors in mixed-motive games using process-aware analysis of both reasoning and communication rather than outcome-only metrics.

Eval & Observability 2601.08462 notes →

Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

Benchmarks whether agents can proactively use long-term memory to execute tool-based actions, rather than just passively retrieving facts on demand.

Eval & Observability 2601.19935 notes →

Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms

Proposes a formal framework for actively evaluating general-purpose agents across multiple tasks, selecting which tasks and agents to sample next to minimize ranking error over tim…

Eval & Observability 2601.07651 notes →

VirtualEnv: A Platform for Embodied AI Research

Introduces an Unreal Engine 5 simulation platform for benchmarking LLM-driven agents on embodied tasks including navigation, object manipulation, and multi-agent coordination in pr…

Eval & Observability 2601.07553 notes →

FROAV: A Framework for RAG Observation and Agent Verification

Presents an open-source platform combining visual workflow orchestration with LLM-as-a-Judge evaluation for prototyping and validating RAG-based agent pipelines without infrastruct…

Eval & Observability 2601.07504 notes →

Lost in the Noise: How Reasoning Models Fail with Contextual Distractors

Benchmarks model robustness across 11 RAG, reasoning, alignment, and tool-use tasks against diverse contextual noise types including random documents, irrelevant histories, and har…

Eval & Observability 2601.07226 notes →

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

Introduces a project-oriented memory benchmark with 2,000+ cross-session dialogues across eleven scenarios to evaluate how well agents track evolving goals and dynamic context depe…

Eval & Observability 2601.06966 notes →

IDRBench: Interactive Deep Research Benchmark

Introduces the first benchmark for interactive deep research combining a modular multi-agent framework with on-demand user interaction, a scalable user simulator, and interaction-a…

Eval & Observability 2601.06676 notes →

ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation

Introduces an open-world tool-using environment with 5,571 tools across 204 apps, a task engine for multi-tool workflows with wild constraints, and a state controller that injects …

Eval & Observability 2601.06328 notes →

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

Introduces a tower defense environment for evaluating LLM agent planning and decision-making with low computational demands, multimodal observation, and hallucination assessment su…

Eval & Observability 2601.05899 notes →

MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents

Introduces a user-authored benchmark for memory-aware LLM agents in Minecraft with parametric task templates, machine-checkable validators, and bounded-knowledge evaluation under a…

Eval & Observability 2601.05215 notes →

Internal Representations as Indicators of Hallucinations in Agent Tool Selection

Proposes a framework for detecting tool-calling hallucinations in LLM agents by analyzing internal representations during a single forward pass, targeting incorrect tool selection,…

Eval & Observability 2601.05214 notes →

Agent-as-a-Judge

Surveys the evolution from LLM-as-a-Judge to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory f…

Eval & Observability 2601.05111 notes →

Arabic Prompts with English Tools: A Benchmark

Introduces the first benchmark for evaluating tool-calling and agentic capabilities of LLMs in Arabic, measuring functional accuracy and robustness in Arabic agentic workflows.

Eval & Observability 2601.05101 notes →

Effects of Personality Steering on Cooperative Behavior in LLM Agents

Examines how Big Five personality steering affects cooperative behavior in LLM agents using repeated Prisoner's Dilemma games across multiple model generations.

Eval & Observability 2601.05302 notes →

Analyzing Message-Code Inconsistency in AI Coding Agent-Authored Pull Requests

Analyzes message-code inconsistency in pull requests authored by AI coding agents across five agent systems to study trustworthiness of agent-generated PR descriptions.

Eval & Observability 2601.04886 notes →

GUITester: Enabling GUI Agents for Exploratory Defect Discovery

Proposes a multi-agent framework for autonomous exploratory GUI testing that decouples navigation from verification via planning-execution and hierarchical reflection modules.

Eval & Observability 2601.04500 notes →

Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems

Introduces the concept of agent drift and a composite metric framework for quantifying semantic, coordination, and behavioral degradation in multi-agent LLM systems over extended i…

Eval & Observability 2601.04170 notes →

M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

Introduces a unified benchmark for evaluating Multi-Agent Debate methods across multiple domains, modalities, and efficiency metrics including token consumption and inference time.

Eval & Observability 2601.02854 notes →

Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts

Documents six recurring failure modes across four end-to-end attempts at autonomous ML research using a pipeline of LLM agents mapped to stages of the scientific workflow.

Eval & Observability 2601.03315 notes →

LongDA: Benchmarking LLM Agents for Long-Document Data Analysis

Introduces a data analysis benchmark for evaluating LLM agents under documentation-intensive analytical workflows requiring long document navigation and multi-step computation.

Eval & Observability 2601.02598 notes →

The Rise of Agentic Testing: Multi-Agent Systems for Robust Software Quality Assurance

Proposes a closed-loop multi-agent testing framework with generation, execution analysis, and review optimization agents for autonomous software test refinement.

Eval & Observability 2601.02454 notes →

Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents

Proposes a causal framework using structural causal models and counterfactual interventions to audit whether reasoning traces in LLM agents are faithful generative drivers or post-…

Eval & Observability 2601.02314 notes →

ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions

Introduces a benchmark for evaluating agent reliability across consistency, robustness to perturbations, and fault tolerance under chaos-engineering-style tool failure injection.

Eval & Observability 2601.06112 notes →

MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability

Introduces an evaluation suite that standardizes MAS configuration and execution, exports framework-agnostic execution traces, and enables systematic reliability assessment across …

Eval & Observability 2601.00481 notes →

Beyond Perfect APIs: WildAGTEval

Introduces a benchmark for evaluating LLM agent function-calling under realistic API complexity including noisy outputs, detailed specifications, and runtime challenges.

Eval & Observability 2601.00268 notes →

TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging

Proposes a multi-agent observe-analyze-repair loop that uses runtime traces to find and fix bugs in LLM-generated code.

Agent Tooling 2602.06875 notes →

Generative Ontology: When Structured Knowledge Learns to Create

Explores constraining LLM generation with executable schemas and multi-agent roles to produce structurally valid yet creative outputs.

Agent Tooling 2602.05636 notes →

Structured Context Engineering for File-Native Agentic Systems

Tests how context format (YAML, JSON, Markdown) affects agent accuracy across 9,649 experiments in file-native agentic systems.

Agent Tooling 2602.05447 notes →

ProAct: Agentic Lookahead in Interactive Environments

Explores training agents to think ahead by distilling environment search into causal reasoning chains in interactive environments.

Agent Tooling 2602.05327 notes →

Autonomous Question Formation for Large Language Model-Driven AI Systems

Investigates teaching agents to ask themselves the right questions before acting to adapt to new situations autonomously.

Agent Tooling 2602.01556 notes →

From Perception to Action: Spatial AI Agents and World Models

Surveys the connection between agentic architectures and spatial tasks like robotics and navigation, covering memory, planning, and world models in embodied agents.

Agent Tooling 2602.01644 notes →

World Models as an Intermediary between Agents and the Real World

Argues for using world models as a bridge between agents and high-cost real-world environments to provide richer learning signals across domains like robotics and ML engineering.

Agent Tooling 2602.00785 notes →

Engineering AI Agents for Clinical Workflows: A Case Study in Architecture, MLOps, and Governance

Presents a reference architecture for production AI agents integrating Clean Architecture, event-driven design, per-agent MLOps lifecycles, and human-in-the-loop governance.

Agent Tooling 2602.00751 notes →

Autonomous Data Processing using Meta-Agents

Proposes a meta-agent framework that builds, runs, and keeps refining data processing pipelines through hierarchical agent orchestration.

Agent Tooling 2602.00307 notes →

MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

Proposes a multi-agent framework for automatically building executable test environments across ten programming languages using planning-execution-verification with environment reu…

Agent Tooling 2601.22859 notes →

Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training

Proposes an adaptive data generation framework for training mobile GUI agents that matches task difficulty to the agent's current capability level.

Agent Tooling 2601.22781 notes →

AutoRefine: From Trajectories to Reusable Expertise for Continual LLM Agent Refinement

Proposes extracting dual-form reusable expertise from agent execution histories — specialized subagents for procedural tasks and skill patterns for static knowledge — with continuo…

Agent Tooling 2601.22758 notes →

ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents

Proposes modeling GUI agent operations as sequences of learnable tool tokens with semantic anchoring and curriculum-based training instead of coordinate-based visual grounding.

Agent Tooling 2602.02548 notes →

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

Proposes a framework combining a self-evolving multi-agent data engine with verifier-based reinforcement learning to train multi-turn interactive tool-using agents.

Agent Tooling 2601.22607 notes →

Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents

Investigates why step-wise reasoning struggles with long-horizon planning in LLM agents and proposes future-aware lookahead with reward estimation to let early actions account for …

Agent Tooling 2601.22311 notes →

SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Proposes a test-time scaling method for software engineering agents that recycles prior trajectories and branches at critical intermediate steps instead of resampling from scratch.

Agent Tooling 2601.22129 notes →

Optimizing Agentic Workflows using Meta-tools

Proposes bundling recurring sequences of agent tool calls into deterministic meta-tools to skip unnecessary intermediate LLM reasoning steps and cut failures.

Agent Tooling 2601.22037 notes →

astra-langchain4j: Experiences Combining LLMs and Agent Programming

Explores integrating LLM capabilities into the ASTRA agent programming language to study how traditional agent toolkits and modern LLM-based agentic platforms can inform each other…

Agent Tooling 2601.21879 notes →

Meta Context Engineering via Agentic Skill Evolution

Introduces a bi-level framework where a meta-agent evolves context engineering skills via agentic crossover while a base agent executes them to optimize context as files and code.

Agent Tooling 2601.21557 notes →

DataCross: A Unified Benchmark and Agent Framework for Cross-Modal Heterogeneous Data Analysis

Proposes a multi-agent framework and benchmark for cross-modal data analysis that coordinates specialized sub-agents via a divide-and-conquer workflow across structured and unstruc…

Agent Tooling 2601.21403 notes →

CovAgent: Overcoming the 30% Curse of Mobile Application Coverage with Agentic AI and Dynamic Instrumentation

Explores agentic AI for Android app testing that uses code inspection and dynamic instrumentation to reach activities that standard GUI fuzzers cannot access.

Agent Tooling 2601.21253 notes →

CUA-Skill: Develop Skills for Computer Using Agent

Introduces a large-scale computer-using agent skill library with parameterized execution, composition graphs, dynamic retrieval, and memory-aware failure recovery for desktop appli…

Agent Tooling 2601.21123 notes →

Textual Equilibrium Propagation for Deep Compound AI Systems

Explores local equilibrium propagation for optimizing deep compound AI systems that avoids signal degradation in long-horizon agentic workflows by replacing global textual backprop…

Agent Tooling 2601.21064 notes →

Should I Have Expressed a Different Intent? Counterfactual Generation for LLM-Based Autonomous Control

Investigates counterfactual reasoning in agentic LLM control scenarios using structural causal models and conformal prediction for formal reliability guarantees.

Agent Tooling 2601.20090 notes →

Insight Agents: An LLM-Based Multi-Agent System for Data Insights

Introduces a hierarchical multi-agent system with out-of-domain detection and BERT-based agent routing for delivering personalized data insights at production scale.

Agent Tooling 2601.20048 notes →

Agentic Design Patterns: A System-Theoretic Framework

Introduces a system-theoretic framework that decomposes agentic AI into five functional subsystems and derives 12 reusable design patterns for building robust agent architectures.

Agent Tooling 2601.19752 notes →

A Practical Guide to Agentic AI Transition in Organizations

Explores a pragmatic framework for transitioning organizational processes to agentic AI, covering domain-driven use case identification, task delegation, and human-in-the-loop oper…

Agent Tooling 2602.10122 notes →

JitRL: Just-In-Time Reinforcement Learning for Continual Learning in LLM Agents Without Gradient Updates

Proposes a training-free continual learning framework for LLM agents that retrieves relevant past experiences and modulates output logits at test time without gradient updates.

Agent Tooling 2601.18510 notes →

Think-Augmented Function Calling: Improving LLM Parameter Accuracy Through Embedded Reasoning

Proposes embedding explicit reasoning at both function and parameter levels during agent tool calls, with dynamic complexity scoring to trigger granular justification for critical …

Agent Tooling 2601.18282 notes →

Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

Investigates which RL training environment properties and modeling choices most influence cross-domain generalization for LLM agents deployed beyond their training domains.

Agent Tooling 2601.18217 notes →

Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation

Proposes disaggregating LLM investigation into bounded local evidence mining with deterministic graph traversal and belief propagation for reliable open-ended agent reasoning.

Agent Tooling 2601.17915 notes →

AI Agent for Reverse-Engineering Legacy Finite-Difference Code

Presents a LangGraph-based AI agent framework combining GraphRAG, multi-stage retrieval, and RL-inspired adaptive feedback for reverse-engineering legacy scientific code.

Agent Tooling 2601.18381 notes →

PatchIsland: Orchestration of LLM Agents for Continuous Vulnerability Repair

Proposes a continuous vulnerability repair system that orchestrates a diverse LLM agent ensemble with two-phase deduplication for integration with continuous fuzzing pipelines.

Agent Tooling 2601.17471 notes →

DALIA: Towards a Declarative Agentic Layer for Intelligent Agents in MCP-Based Server Ecosystems

Introduces a declarative architectural layer for agentic workflows with formalized capabilities, declarative discovery protocol, and deterministic task graph construction.

Agent Tooling 2601.17435 notes →

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

Presents a task-aware context pruning framework for coding agents that trains a lightweight neural skimmer to selectively retain relevant code lines based on explicit goals.

Agent Tooling 2601.16746 notes →

REprompt: Prompt Generation for Intelligent Software Development Guided by Requirements Engineering

Proposes a multi-agent prompt optimization framework guided by requirements engineering principles for system and user prompts in agent-based software development.

Agent Tooling 2601.16507 notes →

EvoConfig: Self-Evolving Multi-Agent Systems for Efficient Autonomous Environment Configuration

Introduces a self-evolving multi-agent framework for automated environment configuration with expert diagnosis and dynamic error-fixing priority adjustment.

Agent Tooling 2601.16489 notes →

SemanticALLI: Caching Reasoning, Not Just Responses, in Agentic Systems

Proposes a pipeline-aware caching architecture for agentic systems that elevates structured intermediate reasoning representations to first-class cacheable artifacts to reduce redu…

Agent Tooling 2601.16286 notes →

Controlling Long-Horizon Behavior in Language Model Agents with Explicit State Dynamics

Investigates imposing explicit dynamical structure on an external affective state to induce temporal coherence and controlled recovery in multi-turn dialogue agents.

Agent Tooling 2601.16087 notes →

Agentic Uncertainty Quantification

Proposes a Dual-Process framework that transforms verbalized uncertainty into bi-directional control signals for agent memory and reflection to prevent cascading hallucination erro…

Agent Tooling 2601.15703 notes →

Agentic AI Governance and Lifecycle Management in Healthcare

Presents a Unified Agent Lifecycle Management blueprint with five control-plane layers for governing agent fleets including identity registry, orchestration, and runtime policy enf…

Agent Tooling 2601.15630 notes →

Autonomous Business System via Neuro-symbolic AI

Introduces a neuro-symbolic architecture that integrates LLM agents with predicate-logic programming and knowledge graphs to orchestrate end-to-end business initiatives through tas…

Agent Tooling 2601.15599 notes →

How to Build AI Agents by Augmenting LLMs with Codified Human Expert Domain Knowledge? A Software Engineering Framework

Proposes a software engineering framework for capturing and embedding codified human domain knowledge into LLM-based agents through request classification, RAG, and expert rule int…

Agent Tooling 2601.15153 notes →

Agent Identity URI Scheme: Topology-Independent Naming and Capability-Based Discovery for Multi-Agent Systems

Defines the agent:// URI scheme that decouples agent identity from network location through trust roots, hierarchical capability paths, and cryptographic attestation for multi-agen…

Agent Tooling 2601.14567 notes →

Toward Efficient Agents: Memory, Tool learning, and Planning

Surveys efficiency in agent systems across memory, tool learning, and planning, comparing approaches under fixed cost budgets and analyzing the Pareto frontier between effectivenes…

Agent Tooling 2601.14192 notes →

Toward self-coding information systems

Proposes self-coding information systems that use agentic AI to dynamically generate, test, and redeploy their own source code at runtime to reduce feature delivery time.

Agent Tooling 2601.14132 notes →

A Lightweight Modular Framework for Constructing Autonomous Agents Driven by Large Language Models: Design, Implementation, and Applications in AgentForge

Presents a lightweight open-source Python framework for building LLM-driven agents with composable skill abstractions, a unified LLM backend interface, and declarative YAML-based c…

Agent Tooling 2601.13383 notes →

MagicGUI-RMS: A Multi-Agent Reward Model System for Self-Evolving GUI Agents via Automated Feedback Reflux

Introduces a multi-agent reward model system for GUI agents that combines domain-specific and general-purpose reward models with automated data reflux for self-evolving agent train…

Agent Tooling 2601.13060 notes →

Agentic AI Meets Edge Computing in Autonomous UAV Swarms

Investigates three deployment architectures for integrating LLM-based agentic AI with edge computing in UAV swarms, covering standalone, edge-enabled, and edge-cloud hybrid configu…

Agent Tooling 2601.14437 notes →

Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents

Proposes a unified taxonomy decomposing AI agents into Perception, Brain, Planning, Action, Tool Use, and Collaboration subsystems, covering MCP, native computer use, and evaluatio…

Agent Tooling 2601.12560 notes →

Agentic Reasoning for Large Language Models

Surveys agentic reasoning across foundational, self-evolving, and collective multi-agent dimensions, distinguishing in-context reasoning from post-training approaches across planni…

Agent Tooling 2601.12538 notes →

POLARIS: Typed Planning and Governed Execution for Agentic AI in Back-Office Automation

Introduces a governed orchestration framework that treats agentic automation as typed plan synthesis with DAG-based planning, rubric-guided selection, validator-gated execution, an…

Agent Tooling 2601.11816 notes →

From Everything-is-a-File to Files-Are-All-You-Need: How Unix Philosophy Informs the Design of Agentic AI Systems

Explores how the Unix 'everything is a file' principle informs agentic AI design through file-like abstractions and code-based specifications for composable, auditable agent interf…

Agent Tooling 2601.11672 notes →

Towards AGI A Pragmatic Approach Towards Self Evolving Agent

Introduces a hierarchical self-evolving multi-agent framework that integrates curriculum learning, reward-based learning, and genetic algorithm evolution for continuous autonomous …

Agent Tooling 2601.11658 notes →

EvoFSM: Controllable Self-Evolution for Deep Research with Finite State Machines

Proposes a self-evolving agent framework that evolves an explicit finite state machine instead of free-form code rewriting, constraining flow and skill optimization to a structured…

Agent Tooling 2601.09465 notes →

Investigating Tool-Memory Conflicts in Tool-Augmented LLMs

Identifies and studies a conflict type where a tool-augmented LLM's internal knowledge contradicts external tool outputs, evaluating whether existing resolution techniques like pro…

Agent Tooling 2601.09760 notes →

MAXS: Meta-Adaptive Exploration with LLM Agents

Uses lookahead planning to estimate the value of tool usage at each step and selects stable, high-value reasoning paths, with a convergence mechanism that halts rollouts once consi…

Agent Tooling 2601.09259 notes →

ToolACE-MCP: Generalizing History-Aware Routing from MCP Tools to the Agent Web

Trains history-aware routers for large-scale MCP tool ecosystems using dependency graphs and multi-turn trajectory synthesis to generalize across multi-agent collaboration and mass…

Agent Tooling 2601.08276 notes →

Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning

Proposes iterative query planning for tool retrieval that decomposes instructions into sub-tasks and dynamically generates queries, trained via synthetic trajectories and reinforce…

Agent Tooling 2601.07782 notes →

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Introduces a Computer-Using Agent framework with milestone-driven long-term memory for trajectory-level self-correction and a multimodal searcher that synthesizes live, visually al…

Agent Tooling 2601.07779 notes →

SAGE: Tool-Augmented LLM Task Solving Strategies in Scalable Multi-Agent Environments

Presents a conversational AI interface for dynamic tool discovery and execution via the OPACA framework, comparing multiple task-solving strategies across different agent setups an…

Agent Tooling 2601.09750 notes →

Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning

Proposes test-time tool evolution where agents synthesize, verify, and evolve executable tools during inference instead of relying on static pre-defined tool libraries.

Agent Tooling 2601.07641 notes →

MegaFlow: Large-Scale Distributed Orchestration System for the Agentic Era

Introduces a large-scale distributed orchestration system that decouples agent training into independent Model, Agent, and Environment services for scheduling tens of thousands of …

Agent Tooling 2601.07526 notes →

JudgeFlow: Agentic Workflow Optimization via Block Judge

Proposes an evaluation-judge-optimization pipeline that assigns block-level responsibility scores to failing logic blocks in agentic workflows, focusing modifications on the most p…

Agent Tooling 2601.07477 notes →

R-LAM: Reproducibility-Constrained Large Action Models for Scientific Workflow Automation

Introduces a reproducibility-constrained framework for Large Action Models with structured action schemas, deterministic execution policies, and provenance tracking to ensure audit…

Agent Tooling 2601.09749 notes →

OpenTinker: Separating Concerns in Agentic Reinforcement Learning

Proposes a composable RL infrastructure for LLM agents that separates algorithm design, execution, and agent-environment interaction with a centralized scheduler for managing share…

Agent Tooling 2601.07376 notes →

ARM: Role-Conditioned Neuron Transplantation for Training-Free Generalist LLM Agent Merging

Introduces activation-guided, role-conditioned neuron transplantation for training-free merging of environment-specific LLM agent experts into a single generalist model.

Agent Tooling 2601.07309 notes →

PRISM: Disentangling SFT and RL Data via Gradient Concentration

Proposes a dynamics-aware framework grounded in Schema Theory that routes agent training data to SFT or RL based on gradient concentration, using cognitive conflict as the allocati…

Agent Tooling 2601.07224 notes →

ET-Agent: Incentivizing Effective Tool-Integrated Reasoning Agent via Behavior Calibration

Introduces a training framework for calibrating agent tool-use behavior through a self-evolving data flywheel and two-phase behavior calibration to reduce redundant and insufficien…

Agent Tooling 2601.06860 notes →

No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

Proposes a co-evolutionary framework that jointly optimizes the agent policy and its natural-language critic through synchronized GRPO updates, preventing the critic from becoming …

Agent Tooling 2601.06794 notes →

CEDAR: Context Engineering for Agentic Data Science

Introduces context engineering techniques for agentic workflows including structured DS-specific prompting, separate plan and code agents, and smart history rendering for fault tol…

Agent Tooling 2601.06606 notes →

ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

Proposes a reinforcement learning paradigm that replaces pointwise scalar scoring with intra-group relative ranking via tournament-based schemes to address discrimination collapse …

Agent Tooling 2601.06487 notes →

Architecting AgentOps Needs CHANGE

Introduces a conceptual framework with six capabilities (Contextualize, Harmonize, Anticipate, Negotiate, Generate, Evolve) for architecting AgentOps platforms that manage the life…

Agent Tooling 2601.06456 notes →

Can We Predict Before Executing Machine Learning Agents?

Proposes internalizing execution priors to predict agent outcomes before physical execution, using a Predict-then-Verify loop to accelerate ML agent workflows without running expen…

Agent Tooling 2601.05930 notes →

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Proposes an automated framework for generating scalable tool-interaction environments via programmatic synthesis, constructing diverse environment skeletons and task scenarios for …

Agent Tooling 2601.05808 notes →

LIDL: LLM Integration Defect Localization via Knowledge Graph-Enhanced Multi-Agent Analysis

Proposes a multi-agent framework for localizing integration defects in LLM-integrated software using code knowledge graphs enriched with LLM-aware annotations and counterfactual re…

Agent Tooling 2601.05539 notes →

AT²PO: Agentic Turn-based Policy Optimization via Tree Search

Proposes a unified framework for multi-turn agentic RL that uses a turn-level tree structure for entropy-guided exploration, turn-wise credit assignment, and turn-based policy opti…

Agent Tooling 2601.04767 notes →

M-ASK: Multi-Agent Search and Knowledge Optimization Framework

Proposes a framework that decouples agentic search into Search Behavior Agents and Knowledge Management Agents with turn-level rewards for multi-hop QA.

Agent Tooling 2601.04703 notes →

AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering

Reframes agent self-improvement as a release engineering pipeline with implementation-blind quality signals, symptom-level diagnosis, and flip-centered regression gating.

Agent Tooling 2601.04620 notes →

4D-ARE: 4-Dimensional Attribution-Driven Agent Requirements Engineering

Proposes an attribution-driven requirements engineering methodology for specifying what domain knowledge LLM agents need at design time, organized along four causal dimensions.

Agent Tooling 2601.04556 notes →

XGrammar 2: Dynamic and Efficient Structured Generation Engine for Agentic LLMs

Proposes a structured generation engine for agentic LLMs with dynamic tag dispatching, JIT compilation, and cross-grammar caching for tool calling and conditional structured genera…

Agent Tooling 2601.04426 notes →

Transitive Expert Error and Routing Problems in Complex AI Systems

Formalizes transitive expert error in AI routing architectures including MoE, multi-model orchestration, and tool-using agents, proposing boundary-aware calibration and coverage ga…

Agent Tooling 2601.04416 notes →

O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL

Introduces a multi-agent workflow for synthesizing research-grade training data with a two-stage SFT plus agentic RL strategy for open-source deep research models.

Agent Tooling 2601.03743 notes →

Architecting Agentic Communities using Design Patterns

Proposes design patterns for architecting agentic communities derived from enterprise distributed systems standards, covering coordination, governance, and formal collaboration agr…

Agent Tooling 2601.03624 notes →

SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

Proposes a skill-conditioned RL framework for tool-using agents that grounds reward modeling in a library of skill prototypes for mid-level credit assignment.

Agent Tooling 2601.03555 notes →

Enhancing Model Context Protocol (MCP) with Context-Aware Server Collaboration

Proposes a Context-Aware MCP architecture with a Shared Context Store that enables MCP servers to coordinate autonomously by reading from and writing to shared context memory.

Agent Tooling 2601.11595 notes →

Enhancing LLM Instruction Following: An Evaluation-Driven Multi-Agentic Workflow for Prompt Instructions Optimization

Proposes a multi-agentic workflow that decouples optimization of primary task descriptions from constraint optimization using quantitative feedback for iterative prompt refinement.

Agent Tooling 2601.03359 notes →

InfiAgent: An Infinite-Horizon Framework for General-Purpose Autonomous Agents

Proposes a general-purpose agent framework that keeps reasoning context bounded regardless of task duration by externalizing persistent state into a file-centric state abstraction.

Agent Tooling 2601.03204 notes →

The Path Ahead for Agentic AI: Challenges and Opportunities

Surveys agentic AI architectures covering planning, memory, tool use, and iterative reasoning with a critical assessment of safety, alignment, and reliability challenges.

Agent Tooling 2601.02749 notes →

AMER-RCL: Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices

Proposes an agentic memory enhanced recursive reasoning framework for root cause localization with cross-alert memory reuse and multi-agent recursive refinement.

Agent Tooling 2601.02732 notes →

Orchestral AI: A Framework for Agent Orchestration

Introduces a lightweight Python framework providing a unified, type-safe interface for building LLM agents across multiple providers with tool calling, memory management, and MCP i…

Agent Tooling 2601.02577 notes →

AI Agent Systems: Architectures, Applications, and Evaluation

Surveys AI agent architectures spanning reasoning, planning, tool calling, orchestration patterns, and deployment settings with a unified taxonomy of agent components and design tr…

Agent Tooling 2601.01743 notes →

CaveAgent: Transforming LLMs into Stateful Runtime Operators

Proposes a dual-stream architecture that elevates the persistent Python runtime as the central locus of agent state, with stateful runtime management and skill injection for long-h…

Agent Tooling 2601.01569 notes →

Actively Obtaining Environmental Feedback for Autonomous Action Evaluation Without Predefined Measurements

Proposes an active feedback model where AI agents proactively interact with the environment to discover and verify feedback without relying on predefined measurements.

Agent Tooling 2601.04235 notes →

Warp-Cortex: An Asynchronous, Memory-Efficient Architecture for Million-Agent Cognitive Scaling on Consumer Hardware

Proposes an asynchronous architecture for million-agent scaling that reduces memory complexity via singleton weight sharing and topological synapse-inspired KV-cache sparsification…

Agent Tooling 2601.01298 notes →

Internal Safety Collapse in Frontier Large Language Models

Reveals that AI agents produce harmful content (toxic text, exploits, dangerous data) as a side effect of completing normal professional tasks — no adversarial prompting needed. At…

AI Agent Security 2603.23509 notes →

Confundo: Learning to Generate Robust Poison for Practical RAG Systems

Trains an LLM to generate RAG poison that survives real-world content processing and query variation for stress-testing RAG defenses.

AI Agent Security 2602.06616 notes →

Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study

Analyzes 98K agent skills from community registries to study the prevalence and nature of malicious third-party agent plugins.

AI Agent Security 2602.06547 notes →

Subgraph Reconstruction Attacks on Graph RAG Deployments with Practical Defenses

Investigates whether attackers can reconstruct knowledge graphs from Graph RAG outputs through multi-turn probing.

AI Agent Security 2602.06495 notes →

Zero-Trust Runtime Verification for Agentic Payment Protocols

Proposes consume-once mandate semantics for AI agent payment protocols to prevent replay and redirect attacks in autonomous transactions.

AI Agent Security 2602.06345 notes →

Identifying Adversary Tactics and Techniques in Malware Binaries with an LLM Agent

Explores using an LLM agent to identify attack techniques in stripped malware binaries through incremental context retrieval.

AI Agent Security 2602.06325 notes →

Agent2Agent Threats in Safety-Critical LLM Assistants: A Human-Centric Taxonomy

Maps attack paths in agent-to-agent communication protocols for automotive LLM assistants, from driver distraction to unauthorized vehicle control.

AI Agent Security 2602.05877 notes →

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Explores using reinforcement learning to auto-generate prompt injection attacks that transfer across multiple frontier LLM models.

AI Agent Security 2602.05746 notes →

A Dual-Loop Agent Framework for Automated Vulnerability Reproduction

Proposes an LLM agent with dual feedback loops for strategy and code to automate vulnerability reproduction from CVE descriptions.

AI Agent Security 2602.05721 notes →

Human Society-Inspired Approaches to Agentic AI Security: The 4C Framework

Organizes agentic security risks into four layers (Core, Connection, Cognition, Compliance) to address trust and governance issues beyond prompt injection.

AI Agent Security 2602.01942 notes →

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Proposes a co-evolving RL game between an attacker and defender agent to stress-test safety alignment against novel attack patterns.

AI Agent Security 2602.01539 notes →

TxRay: Agentic Postmortem of Live Blockchain Attacks

Introduces an LLM agentic system that reconstructs blockchain exploit lifecycles from limited evidence and generates runnable proof-of-concept reproductions.

AI Agent Security 2602.01317 notes →

To Defend Against Cyber Attacks, We Must Teach AI Agents to Hack

Argues that AI-agent-driven cyber attacks are inevitable and proposes building frontier offensive AI capabilities responsibly as essential defensive infrastructure.

AI Agent Security 2602.02595 notes →

SMCP: Secure Model Context Protocol

Proposes protocol-level security improvements for the Model Context Protocol including unified identity management, mutual authentication, and fine-grained policy enforcement.

AI Agent Security 2602.01129 notes →

Persuasion Propagation in LLM Agents

Investigates how user persuasion during conversation can carry over and change how autonomous AI agents perform later tasks.

AI Agent Security 2602.00851 notes →

When Agents "Misremember" Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems

Explores how collective false memories form in LLM-based multi-agent systems and proposes defenses including cognitive anchoring and alignment-based approaches.

AI Agent Security 2602.00428 notes →

"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval

Proposes a black-box attack method that generates transferable adversarial tokens to manipulate LLM-based retrieval systems without needing access to the target's queries or model.

AI Agent Security 2602.00364 notes →

Introduces CacheAttack, a black-box framework that exploits the trade-off between locality and collision resistance in semantic caching to hijack LLM responses and manipulate agent…

AI Agent Security 2601.23088 notes →

TessPay: Verify-then-Pay Infrastructure for Trusted Agentic Commerce

Proposes a verify-then-pay infrastructure for agent transactions that locks funds in escrow, requires cryptographic proof of task execution, and releases payment only after verific…

AI Agent Security 2602.00213 notes →

Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection

Red-teams Google's Agent Payments Protocol via prompt injection attacks that manipulate product ranking and extract sensitive user data in agent-led purchase flows.

AI Agent Security 2601.22569 notes →

StepShield: When, Not Whether to Intervene on Rogue Agents

Introduces a benchmark for evaluating when agent violations are detected during execution rather than just whether, with temporal metrics for early intervention and tokens saved.

AI Agent Security 2601.22136 notes →

Delegation Without Living Governance

Argues that static compliance-based governance is insufficient for agentic AI at machine speed and proposes runtime governance to preserve human relevance in agent-driven decision-…

AI Agent Security 2601.21226 notes →

DRAINCODE: Stealthy Energy Consumption Attacks on Retrieval-Augmented Code Generation via Context Poisoning

Introduces an adversarial attack that poisons retrieval contexts in RAG-based code generation to force longer outputs, increasing GPU latency and energy consumption.

AI Agent Security 2601.20615 notes →

Securing AI Agents in Cyber-Physical Systems: A Survey of Environmental Interactions, Deepfake Threats, and Defenses

Surveys security threats targeting AI agents in cyber-physical systems, covering deepfake attacks, MCP-mediated vulnerabilities, and defense-in-depth architectures.

AI Agent Security 2601.20184 notes →

Multimodal Multi-Agent Ransomware Analysis Using AutoGen

Explores AutoGen-based multi-agent coordination with specialized agents for static, dynamic, and network-level ransomware family classification using confidence-aware decisions.

AI Agent Security 2601.20346 notes →

SHIELD: An Auto-Healing Agentic Defense Framework for LLM Resource Exhaustion Attacks

Introduces a multi-agent auto-healing defense framework with semantic similarity retrieval, pattern matching, and an evolving knowledgebase for defending LLMs against resource exha…

AI Agent Security 2601.19174 notes →

AgenticSCR: An Autonomous Agentic Secure Code Review for Immature Vulnerabilities Detection

Explores agentic AI for pre-commit secure code review that uses autonomous decision-making, tool invocation, and security-focused semantic memories to detect immature vulnerabiliti…

AI Agent Security 2601.19138 notes →

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Introduces a three-dimensional taxonomy for agentic risks and a diagnostic guardrail framework that monitors agent trajectories with fine-grained root cause analysis beyond binary …

AI Agent Security 2601.18491 notes →

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

Examines how benign personal memories in personalized agents can bias intent inference and cause models to legitimize harmful queries through a previously unexplored safety vector.

AI Agent Security 2601.17887 notes →

Multi-Agent Collaborative Intrusion Detection for LAE-IoT

Proposes a multi-agent collaborative framework with specialized LLM-enhanced agents for intelligent data processing and adaptive intrusion classification in aerial IoT networks.

AI Agent Security 2601.17817 notes →

Faramesh: A Protocol-Agnostic Execution Control Plane for Autonomous Agent Systems

Introduces a protocol-agnostic execution control plane for autonomous agents that enforces authorization boundaries with canonical action representation and deterministic policy ev…

AI Agent Security 2601.17744 notes →

A Systemic Evaluation of Multimodal RAG Privacy

Examines privacy risks in multimodal RAG pipelines through inclusion inference and metadata leakage attacks during standard model prompting.

AI Agent Security 2601.17644 notes →

Breaking the Protocol: Security Analysis of the Model Context Protocol Specification

Presents the first security analysis of the Model Context Protocol specification, identifying three protocol-level vulnerabilities and proposing backward-compatible security extens…

AI Agent Security 2601.17549 notes →

Prompt Injection Attacks on Agentic Coding Assistants: A Systematic Analysis

Surveys 78 studies to systematize prompt injection attacks on agentic coding assistants with a three-dimensional taxonomy across delivery vectors, modalities, and propagation.

AI Agent Security 2601.17548 notes →

Connect the Dots: Knowledge Graph-Guided Crawler Attack on Retrieval-Augmented Generation Systems

Introduces RAGCrawler, a knowledge graph-guided attack that adaptively steals RAG corpus content through targeted queries to maximize coverage under a query budget.

AI Agent Security 2601.15678 notes →

Securing LLM-as-a-Service for Small Businesses: An Industry Case Study of a Distributed Chatbot Deployment Platform

Presents a multi-tenant chatbot deployment platform with container-based isolation and platform-level defenses against prompt injection attacks in RAG-based systems.

AI Agent Security 2601.15528 notes →

Interoperable Architecture for Digital Identity Delegation for AI Agents with Blockchain Integration

Introduces delegation grants and a canonical verification context for bounded, auditable identity delegation across human users and AI agents in heterogeneous identity ecosystems.

AI Agent Security 2601.14982 notes →

INFA-Guard: Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems

Proposes an infection-aware defense framework for multi-agent systems that distinguishes infected agents from attackers and applies topological constraints to halt malicious propag…

AI Agent Security 2601.14667 notes →

Query-Efficient Agentic Graph Extraction Attacks on GraphRAG Systems

Proposes AGEA, an agentic framework using novelty-guided exploration and graph memory to steal latent entity-relation graphs from GraphRAG systems under strict query budgets.

AI Agent Security 2601.14662 notes →

NeuroFilter: Privacy Guardrails for Conversational LLM Agents

Introduces activation-space guardrails that detect privacy-violating intent in LLM agents through linear separation of internal representations, including drift detection across mu…

AI Agent Security 2601.14660 notes →

VirtualCrime: Evaluating Criminal Potential of Large Language Models via Sandbox Simulation

Proposes a three-agent sandbox simulation framework with 40 crime tasks across 13 objectives to evaluate the criminal capabilities of LLM agents in realistic scenarios.

AI Agent Security 2601.13981 notes →

PINA: Prompt Injection Attack against Navigation Agents

Introduces an adaptive prompt injection framework targeting navigation agents under black-box, long-context, and action-executable constraints across indoor and outdoor environment…

AI Agent Security 2601.13612 notes →

Prompt Injection Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

Explores a multi-agent defense pipeline combining semantic similarity caching, nested learning, and observability-aware evaluation to mitigate prompt injection attacks while reduci…

AI Agent Security 2601.13186 notes →

CODE: A Contradiction-Based Deliberation Extension Framework for Overthinking Attacks on Retrieval-Augmented Generation

Introduces an overthinking attack framework for RAG systems with reasoning models, using multi-agent-constructed poisoning samples that cause excessive reasoning token consumption …

AI Agent Security 2601.13112 notes →

AgenTRIM: Tool Risk Mitigation for Agentic AI

Introduces a framework for detecting and mitigating tool-driven agency risks through offline interface verification and runtime per-step least-privilege tool access with adaptive f…

AI Agent Security 2601.12449 notes →

Efficient Privacy-Preserving Retrieval Augmented Generation with Distance-Preserving Encryption

Proposes a privacy-preserving RAG framework using conditional approximate distance-comparison-preserving encryption that enables similarity computation on encrypted embeddings in u…

AI Agent Security 2601.12331 notes →

Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

Proposes a mandatory access control framework for LLM agent systems that monitors agent-tool interactions via information flow graphs and enforces attribute-based policies against …

AI Agent Security 2601.11893 notes → 💬 Tier 2 (곁들이기). 권한 상승 다양한 유형 MAC 프레임워크. 2603.19469 …

Institutional AI: Governing LLM Collusion in Multi-Agent Cournot Markets via Public Governance Graphs

Introduces governance graphs as public, immutable manifests with enforceable sanctions and restorative paths to govern multi-agent LLM coordination and prevent harmful collusion.

AI Agent Security 2601.11369 notes →

SD-RAG: A Prompt-Injection-Resilient Framework for Selective Disclosure in Retrieval-Augmented Generation

Proposes a prompt-injection-resilient RAG framework that decouples security enforcement from generation by applying sanitization and policy-aware disclosure controls during the ret…

AI Agent Security 2601.11199 notes →

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

Introduces a stealthy multi-turn economic DoS attack exploiting the agent-tool communication loop through MCP-compatible tool server modifications that inflate costs by up to 658x.

AI Agent Security 2601.10955 notes →

Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG

Introduces a benchmark and harness for evaluating web-facing RAG systems under indirect prompt injection and retrieval poisoning attacks with standardized end-to-end evaluation fro…

AI Agent Security 2601.10923 notes →

Breaking Up with Normatively Monolithic Agency with GRACE: A Reason-Based Neuro-Symbolic Architecture for Safe and Ethical AI Alignment

Introduces a neuro-symbolic containment architecture that decouples normative reasoning from instrumental decision-making through a Moral Module, Decision-Making Module, and compli…

AI Agent Security 2601.10520 notes →

AgentGuardian: Learning Access Control Policies to Govern AI Agent Behavior

Presents a security framework that learns context-aware access-control policies from monitored execution traces to govern AI agent operations and detect malicious inputs while pres…

AI Agent Security 2601.10440 notes →

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Analyzes 42,447 agent skills from two major marketplaces to study the prevalence and types of security vulnerabilities spanning prompt injection, data exfiltration, privilege escal…

AI Agent Security 2601.10338 notes →

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Proposes single-shot planning for Computer Use Agents that provides provable control flow integrity against prompt injection while preserving agent capability.

AI Agent Security 2601.09923 notes →

Blue Teaming Function-Calling Agents

Tests open-source function-calling LLMs against multiple attack types with various defenses to study the readiness of current models and mitigations for production deployment.

AI Agent Security 2601.09292 notes →

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

Examines how commercial planning and web-use agents handle user-mediated attacks where the user themselves provides adversarial instructions without explicit safety requests.

AI Agent Security 2601.10758 notes →

Semantic Laundering in AI Agent Architectures: Why Tool Boundaries Do Not Confer Epistemic Warrant

Formalizes how propositions gain unwarranted trust by crossing architecturally trusted interfaces in agent systems, studying whether circular epistemic justification is inevitable …

AI Agent Security 2601.08333 notes →

Towards Verifiably Safe Tool Use for LLM Agents

Proposes applying System-Theoretic Process Analysis to identify hazards in agent tool-use workflows, deriving formal safety specifications enforced through a capability-enhanced Mo…

AI Agent Security 2601.08012 notes →

MCP-ITP: An Automated Framework for Implicit Tool Poisoning in MCP

Introduces an automated framework for implicit tool poisoning in MCP where a poisoned tool remains uninvoked but its metadata manipulates the agent into performing malicious operat…

AI Agent Security 2601.07395 notes →

Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems

Proposes a black-box attack that decomposes indirect prompt injection into trigger and attack fragments to study end-to-end IPI exploits under natural queries across RAG and agenti…

AI Agent Security 2601.07072 notes →

MemTrust: A Zero-Trust Architecture for Unified AI Memory System

Proposes a hardware-backed zero-trust architecture for AI memory systems that applies TEE protection across five functional layers with a cross-application sharing protocol for age…

AI Agent Security 2601.07004 notes →

SafePro: Evaluating the Safety of Professional-Level AI Agents

Introduces a benchmark for evaluating safety alignment of AI agents performing professional-level tasks across diverse domains, uncovering new unsafe behaviors in complex professio…

AI Agent Security 2601.06663 notes →

Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset

Demonstrates that off-the-shelf LLM agents with web search can re-identify participants in anonymized qualitative datasets using only natural-language prompts, lowering the technic…

AI Agent Security 2601.05918 notes →

Toward Safe and Responsible AI Agents: A Three-Pillar Model for Transparency, Accountability, and Trustworthiness

Proposes a conceptual and operational framework for safe AI agent development grounded in transparency, accountability, and trustworthiness, with progressive validation analogous t…

AI Agent Security 2601.06223 notes →

VIGIL: Defending LLM Agents Against Tool Stream Injection via Verify-Before-Commit

Proposes a verify-before-commit protocol for defending LLM agents against tool stream injection, using speculative hypothesis generation and intent-grounded verification to balance…

AI Agent Security 2601.05755 notes →

Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Evaluates memory poisoning attacks on memory-augmented LLM agents and proposes two defense mechanisms: input/output moderation with composite trust scoring and memory sanitization …

AI Agent Security 2601.05504 notes →

STELP: Secure Transpilation and Execution of LLM-Generated Programs

Proposes a secure transpiler and executor for LLM-generated code that detects vulnerabilities and safely executes code snippets in autonomous production AI systems without relying …

AI Agent Security 2601.05467 notes →

Conformity and Social Impact on AI Agents

Investigates conformity bias in AI agents under social pressure using adapted visual experiments from social psychology, studying sensitivity to group size, unanimity, task difficu…

AI Agent Security 2601.05384 notes →

Defense Against Indirect Prompt Injection via Tool Result Parsing

Proposes a tool result parsing method for defending LLM agents against indirect prompt injection by providing precise data while filtering out injected malicious code.

AI Agent Security 2601.04795 notes →

Autonomous Agents on Blockchains: Standards, Execution Models, and Trust Boundaries

Surveys agent-blockchain interoperability patterns and threat models for agent-driven transaction pipelines, covering custody models, policy enforcement, and multi-agent workflows.

AI Agent Security 2601.04583 notes →

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

Proposes a stage-aware framework for analyzing backdoor attacks across planning, memory, and tool-use stages of LLM agent workflows with cross-stage trigger propagation.

AI Agent Security 2601.04566 notes →

HoneyTrap: Deceiving LLM Attackers with Resilient Multi-Agent Defense

Proposes a deceptive defense framework using collaborative defender agents to counter multi-turn jailbreak attacks by strategically wasting attacker resources.

AI Agent Security 2601.04034 notes →

SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems

Systematizes privacy risks, mitigation techniques, and evaluation strategies in RAG systems through a comprehensive literature review with a taxonomy and process diagram.

AI Agent Security 2601.03979 notes →

AgentMark: Utility-Preserving Behavioral Watermarking for Agents

Proposes a behavioral watermarking framework that embeds multi-bit identifiers into agent planning decisions for IP protection and regulatory provenance while preserving utility.

AI Agent Security 2601.03294 notes →

Structural Representations for Cross-Attack Generalization in AI Agent Threat Detection

Proposes structural tokenization that encodes execution-flow patterns instead of conversational content to improve cross-attack generalization in AI agent threat detection.

AI Agent Security 2601.01723 notes →

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Introduces a cognitive collusion attack where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels without covert commun…

AI Agent Security 2601.01685 notes →

MCP-SandboxScan: WASM-based Secure Execution and Runtime Analysis for MCP Tools

Proposes a lightweight framework that safely executes untrusted MCP tools inside a WebAssembly sandbox and produces auditable reports of external-to-sink exposures.

AI Agent Security 2601.01241 notes →

Harm in AI-Driven Societies: An Audit of Toxicity Adoption on Chirper.ai

Analyzes toxicity adoption dynamics among LLM-driven agents on a fully AI-driven social platform, studying how cumulative toxic exposure affects the probability of toxic responses.

AI Agent Security 2601.01090 notes →

Trajectory Guard: A Lightweight, Sequence-Aware Model for Real-Time Anomaly Detection in Agentic AI

Proposes a Siamese Recurrent Autoencoder with hybrid contrastive-reconstruction loss for real-time anomaly detection in agent action trajectories.

AI Agent Security 2601.00516 notes →

Mapping Human Anti-collusion Mechanisms to Multi-agent AI

Maps human anti-collusion mechanisms including sanctions, leniency, monitoring, and market design to potential interventions for multi-agent AI systems.

AI Agent Security 2601.00360 notes →

Making Theft Useless: Adulteration-Based Protection of Proprietary Knowledge Graphs in GraphRAG Systems

Proposes a data adulteration framework that pre-emptively injects plausible but false entries into knowledge graphs to make stolen GraphRAG KGs unusable to adversaries.

AI Agent Security 2601.00274 notes →

When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents

Examines intergroup bias in LLM agents under minimal group cues and formalizes a Belief Poisoning Attack that manipulates agent identity beliefs to induce outgroup bias toward huma…

AI Agent Security 2601.00240 notes →

Why Do Multi-Agent LLM Systems Fail?

Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understa…

Multi-Agent 2503.13657 notes → 💬 Tier 1 필독. MAST 실패 모드 택소노미 — 멀티에이전트 실패를 구조화한 논문. K…

Survey on Evaluation of LLM-based Agents

LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the f…

Eval & Observability 2503.16416 notes → 💬 Tier 1. 에이전트 평가 지형 전체 지도. 트라젝토리 평가, MCP Atlas, Too…

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attem…

Eval & Observability 2603.29231 notes → 💬 Tier 1. 신뢰성을 과학으로 다루는 롱호라이즌 프레임워크. Kyle의 control-t…

Building Effective Agents (Anthropic Engineering Blog)

Anthropic 엔지니어링 블로그. 프로덕션 실무자 관점에서 워크플로 vs 에이전트, 오케스트레이션 패턴 등 어휘와 직관을 잡아주는 워밍업 자료. 논문 읽기 전에 먼저 볼 것.

Agent Tooling blog/anthropic/build notes → 💬 Tier 0 — 논문 아님. 에이전트 관련 논문 읽기 전 워밍업. 짧고 그림 위주. 실무 …

Demystifying Evals for AI Agents (Anthropic Blog)

Anthropic 블로그. 에이전트 평가가 왜 어려운지 직관을 제공. Tier 2 평가 논문들을 읽기 전에 이 글로 맥락을 잡으면 훨씬 잘 읽힌다.

Eval & Observability blog/anthropic/demys notes → 💬 Tier 0 — 논문 아님. 에이전트 평가 직관 잡기용. Tier 2 평가 논문(2503.…

A Survey on Long-Term Memory Security in LLM Agents: Attacks, Defenses, and Governance Across the Memory Lifecycle

The emergence of writable, cross-session persistent memory in LLM agents introduces a qualitatively different threat landscape from conventional input-centric security concerns, ch…

Memory & RAG 2604.16548 notes → 💬 Tier 2 필독 (loom 작업자). 메모리 거버넌스를 5개 primitive로 형식화:…

Runtime Governance for AI Agents: Policies on Paths

AI agents -- systems that plan, reason, and act using large language models -- produce non-deterministic, path-dependent behavior that cannot be fully governed at design time, wher…

AI Agent Security 2603.16586 notes → 💬 Tier 2. clawpatrol/ClawFleet의 이론적 이웃. 경로(path)에 정책…

A Framework for Formalizing LLM Agent Security

Security in LLM agents is inherently contextual. For example, the same action taken by an agent may represent legitimate behavior or a security violation depending on whose instruc…

AI Agent Security 2603.19469 notes → 💬 Tier 2. LLM 에이전트 보안 형식화 프레임워크. grant-TTL/approval …

Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks

Long-context Large Language Models, despite their expanded capacity, require careful working memory management to mitigate attention dilution during long-horizon tasks. Yet existin…

Memory & RAG 2510.12635 notes → 💬 Tier 2. 메모리를 action으로 — 롱호라이즌 태스크에서 컨텍스트 자율 큐레이션. …

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

Tool-using multi-agent large language model (LLM) systems spend computation through model tokens, tool calls, retries, and code execution before producing an answer. When a run fai…

Eval & Observability 2606.01365 notes → 💬 Tier 2. 멀티에이전트 낭비 연산 조기 진단 — 실패 인식 관측가능성. 실행이 회복 가…

Mathematical modelling of flow and adsorption in a gas chromatograph

In this paper, a mathematical model is developed to describe the evolution of the concentration of compounds through a gas chromatography column. The model couples mass balances an…

Multi-Agent 2501.00001 notes →

Agent-as-Judge for Factual Summarization of Long Narratives

Large Language Models (LLMs) have demonstrated near-human performance in summarization tasks based on traditional metrics such as ROUGE and BERTScore. However, these metrics do not…

Multi-Agent 2501.09993 notes →

A classification of restrictive polynomial correspondences

In this manuscript, we study a special class of correspondences on $\mathbb{P}^{1} \times \mathbb{P}^{1}$ given by a polynomial relation, say $P(z, w)$. We focus on what we call re…

Multi-Agent 2503.00001 notes →