Techniques to improve the performance of AI on complex tasks
January 10th, 2026
LLM-enabled performance improvement efforts have decidedly moved from a focus on “prompt engineering” to agentic AI systems: AI that can autonomously plan, action and iterate towards some user-defined goals.
An Andreessen Horowitz analysis (Dec 2025) of the 100+ trillion tokens of production data from OpenRouter shows that "agentic inference" is the fastest growing behaviour, whereby developers build workflows with models that act in extended sequences rather than single prompts: "The competitive frontier is no longer only about accuracy or benchmarks. It is about orchestration, control, and a model's ability to operate as a reliable agent".
Whilst definitions vary in terms of whether agentic AI refers to single or multi-agent systems (IBM’s definition allows for either), here we follow the taxonomy provided by the paper “AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges” (May 2025):
AI Agents are modular, single-entity systems optimised for bounded task execution via tool augmentation with three core characteristics:
Autonomy within specific task scope.
Task-specificity / narrow, well-defined operations.
Reactivity with limited adaptation.
Agentic AI refers to multi-agent architectures characterised by:
Dynamic task decomposition: goals parsed and distributed across agent networks.
Persistent memory: episodic, semantic and vector-based memory that spans task cycles.
Coordinated autonomy: centralised or decentralised protocols.
Inter-agent communication: async messaging queues, shared memory buffers, intermediate output exchanges.
Single AI agents suffer from hallucination, sensitivity to prompt, difficulty with multi-step reasoning and causal understanding, limited context window, etc., and so the use of agentic systems and multi-step workflows is required to apply AI to solve hard problems reliably or conduct deep research over a long time horizon.
Multi-agent systems face their own challenges. Below we explore issues and techniques being adopted to address them as reported in the literature.
Owing to the rapid pace of development, some findings below may not apply to current and future models, but provide a guide as to what to explore and test for.
As with any experimental findings, the fact that a given technique was found to be better under some conditions does not mean that it will be so under other conditions, and no approach should be adopted blindly without proper evaluation.
A survey of multi-agent collaboration (Jan 2025) provides an overview of techniques in setting up agentic systems. The paper introduces a five-dimensional framework that characterises the key aspects in agent collaboration:
Actors: the specific agents involved in collaboration.
Types: 3 primary interaction modes
Cooperation: shared goals where agents work together.
Competition: opposing objectives.
Coopetition: mixed dynamics combining both.
Structures: 3 main organisational topologies.
Centralised: “supervisor-based”, collaboration decision is concentrated in a central agent.
Decentralised: “peer-to-peer” / distributed, collaboration decision is distributed amongst multiple agents.
Hierarchical: “layered authority”, agents are arranged in a layered system with distinct roles and levels of authority.
Strategies: ways of defining agent behaviour and cooperation, which are not mutually exclusive and may well all be used in a single system:
Role-based: agents are assigned specialised roles (e.g. “coder”, “reviewer” etc.), and the coordination / handoff between agents are based on this division of labour and role boundaries.
Rule-based: there is a predefined set of rules that dictates how and when agents can collaborate. For example, an agent must always escalate to human approval if confidence level is below 70%.
Model-based: agents maintain internal models of the world and of other agents, and make probabilistic decisions based on their perception of the environment, common objectives, and inherent uncertainties.
Coordination Protocols: communication mechanisms and interaction patterns.
Setting up agents with distinct personas (e.g. planner, coder, reviewer) outperforms single monolithic models and is commonly implemented in practice, with examples cited in literature including:
“Multi-Agent Drug Discovery Orchestra” (Nov 2025), a recent paper in AI-assisted drug discovery found that multi-agent coordination employing 4 coordinated agents to handle key subtasks "significantly outperforms single models" in hit identifications.
MAKER framework (Nov 2025) takes a different approach to conventional role-based agents by using "microagents" to handle a single atomic subtask and whose role is defined by the specific step being executed rather than a defined expertise (explored in more detail in Task decomposition below).
DS-STAR (Sep 2025) automates data science workflows with a multi-agent system (planner, coder, verifier, router) showed a 4% accuracy improvement in state of the art benchmarks over previous results.
MetaGPT (Nov 2024) formalises role-based protocols by encoding Standard Operating Procedures (SOPs), where each agent's role is defined by expert-level knowledge and allowing agents to act as specialised operators for verifying each other's results. The paper states the protocol reduces cascading errors by modularising task distribution and outputs coherent results even in complex projects.
ChatDev (Jun 2024) is a "chat-based" end-to-end software development framework with specialised agents and "chat-chain" architecture that decomposes each phase (design, code, test, document) into atomic subtasks showing promising results in reducing code hallucinations.
A growing area of research in agentic AI is dynamic agent generation as the coordination architecture, i.e. generating custom agents based on the nature of the task rather than relying on static predefined teams, so that the agent who is assigned the task is uniquely tailored to the problem.
Dynamic Real-Time Agent Generation (DRTAG) (Sep 2025) is a "conversation manager" agent that can dynamically generate new LLM agents using prompt engineering techniques as conversations with users evolve. The paper directly compares performance between its dynamic approach with existing static methods and outperforms on all 4 evaluation metrics.
MegaAgent (May 2025) achieved a large-scale multi-agent system with no pre-defined worker agents by combining task decomposition and recursive agent generation. The framework has a "boss" agent that recursively spawns and coordinates task-specific agents when the work needs to be decomposed into sub-tasks. The agents' roles, descriptions and workflows are generated based on task requirements rather than pre-defined.
AutoAgents (Apr 2024) is a framework that generates and coordinates specialised agents to build a team based on incoming tasks. Some key findings from the paper include:
The need for collaborative discussion during team formation. By including an “observer” feedback (rather than just a “planner”), it produces better compositions of teams (e.g. designers, UI specialists, and testers, rather than just coders for a game development task).
Two refinement mechanisms were used: self-refinement within the agent itself and collaborative refinement through knowledge exchange.
Context limitations are overcome by implementing 3 memory types: short (individual intermediate results), long (cross-agent records) and dynamic (supplementary information extracted on demand).
Dynamic agent generation shows promising results in literature, though static agent teams remain easier to debug with lower latency (in avoiding the need to generate agents or team composition).
Generating agents on demand is also dependent on the generator itself producing effective agents, and delivers the most benefit when the specialisations required cannot be fully anticipated (if they can, you might as well generate and optimise agents ahead of time).
Long-running agents remain one of the hardest challenges in agentic AI systems. A recent post by Anthropic (Nov 2025) highlights 2 common failure modes:
The agent tries to do too much at once, runs out of context mid-task, then has to guess what happened.
The agent considers the project "complete" prematurely, one-shotting an app without achieving production quality.
The solution adopted by Anthropic is to mimic what effective human software engineers already do: structured handoff documentations, incremental commits and progress tracking with a two-agent approach:
Initialiser agent (first context window only): creates a comprehensive feature requirements file based on the user's high-level prompt, initialises a git repository and a progress tracking file.
Coding agent (subsequent sessions): makes incremental progress in each session whilst leaving clear artifacts for the next session. Each session begins with the same startup process: confirm location with pwd, read progress file, review feature list and run existing tests before starting implementation.
One potential solution for tackling long-horizon failures is to apply self-correction in LLMs, whereby the model attempts to review and correct its own answers (via various mechanisms). Its effectiveness today is limited, heavily depending on the choice of model, methods used, and whether external feedback is available.
Recent benchmarks (CorrectBench, Oct 2025) found that some self-corrections achieved better performance for instruction-based LLMs, whilst others hurt performance significantly. Reasoning models like Deepseek R1 achieved the best baseline results but showed only marginal improvement from self-correction, likely because they already incorporate internal verification during the extended thinking time.
Research by Tsui (Oct 2025) suggests LLMs possess "blind spots" when carrying out intrinsic self-correction. Through its proposed evaluation framework ("Self-Correction Bench"), the paper found LLMs were able to successfully correct identical errors when presented as external inputs but failed to correct them in their own outputs. Specifically, the paper tested 14 non-reasoning models and found an average blindspot rate of 64.5%.
Google DeepMind (Mar 2024) found that self-correction without external feedback can actually hurt performance. The paper asked the models to review their answers which caused a large proportion of answers that were originally correct to become incorrect depending on the model.
A comprehensive review (Dec 2024) on self-correction LLMs concluded that "no prior work demonstrates successful self-correction with feedback from prompted LLMs alone, except for tasks exceptionally suited for self-correction."
Poor self-correction performance in LLMs stems from their inability to find logical mistakes, rather than their ability to correct a known mistake. A paper by University of Cambridge and Google Research (Jun 2024) showed that the models then in existence achieved only 53% accuracy overall at finding logical errors, and holds true even for objective, unambiguous mistakes that human raters without prior expertise can identify with high agreement. Conversely, corrections were highly effective once LLMs were told where the errors were.
Xu et al. (Jun 2024) found models exhibit self-bias, whereby they systematically favour their own outputs regardless of quality. This affects LLM's ability to objectively evaluate and improve their responses.
There is broad agreement that achieving production-grade results in long-horizon tasks requires an effective harness and architecture, not simply bigger models.
Task decomposition is being considered as a fundamental strategy for tackling long-running tasks. Breaking down a problem into a logical sequence of sub-problems effectively helps to manage context and assure quality, by defining the boundaries such that:
Each sub-task only sees what it needs, not the entire problem.
The context and requirements for the task are more focused.
Agents require fewer tools and face fewer decisions (if any), and can be highly specialised.
It is easier to evaluate the output.
The MAKER framework (Nov 2025) demonstrated promising results in achieving a million-step task with zero errors through applying task decomposition with verification mechanisms and red flagging:
Maximal Agentic Decomposition (MAD): an agent architecture principle of breaking tasks down into the smallest possible subtask such that there is only one step per agent.
“First-to-ahead-by-k” voting: for each step, the agent generates multiple candidate answers where each output counts as one vote for that answer. Agents will continue to generate candidate answers until one answer leads by k votes over all alternatives. This approach when combined with MAD scales the expected cost of solving the puzzle (measured in LLM calls) log-linearly with the number of steps in a task, which makes million-step tasks feasible.
Red-flagging: automatically discard responses that indicate model confusion on the premise that, if a model is about to make a logic error, it often makes a syntax error first or starts rambling. So the “red flags” being:
Overly long responses: model is rambling/confused.
Incorrect formatting: as syntax errors correlate with logic errors.
Exceeds token limit: indicates confusion (assuming tasks are properly decomposed and don’t require lengthy output).
The paper shows the effectiveness of breaking down a problem to subtasks, but is focused on execution only with the assumption that the task decomposition is already given and correct. The verification method is also limited to outputs that are deterministic and definitive (e.g. Towers of Hanoi), rather than more open-ended tasks where it’s ambiguous what the correct answer is.
So how should we optimally decompose a problem? AWS Machine Learning Blog (Dec 2025) finds that an explicit decomposition step should occur before anything else is done, using ReWOO's (Reasoning Without Observation, May 2023) staged architecture with 3 specialised agents:
Planner: produces a strictly formatted program describing tool usage.
Worker: parses the plan, resolves arguments, calls tools and accumulates evidence. Can only execute what the plan authorises.
Solver: reads evidence and synthesises the final answer without calling any tools.
Hori et al. (Feb 2025) highlights that decomposition is essentially an information accumulation problem that requires progressive refinement over one-shot attempts. Ambiguous high-level instructions like "make scrambled eggs" lack the details needed for executable action sequences and require further human input. The paper proposed 2 feedback mechanisms:
Active: model asks users clarifying questions when it's uncertain (e.g. "What temperature should I set the pan to?").
Passive: human spots mistakes and corrects them (e.g. "No, you forgot to crack the eggs first").
Together, these modes increased plan detail by over 50% in the experiments, and having the LLM assess its own ambiguities in active mode reduces the burden on humans to anticipate all the necessary details upfront.
Challenges in long-running tasks are further compounded when they have a goal of open-ended research tasks that don’t typically follow a predictable path. Here we can benefit from a multi-agent research system pattern, such as one proposed by Anthropic (Jun 2025). It outperformed a single-agent model significantly and includes:
Lead researcher agent: analyses the query, creates a research plan, saves the plan to memory (to persist across context truncation), then spawns specialised subagents.
Sub-agents: each receives a focused prompt and works in parallel on specific aspects of the research using web search and other tools.
Citation agent: processes all findings to properly attribute claims to sources.
A key aspect to achieving production-grade agentic AI systems is the ability to assess whether the agents are producing results that satisfy the task’s requirements and goals.
There are various techniques for doing so, which can be used both while developing the system – as fixed tests which evaluate its performance over time against pre-determined tasks, and can be run via frameworks such as OpenAI Evals – and during runtime when evaluating its output on novel tasks.
Human evaluation: remains the most trusted option for assessing qualities that automated methods fail to capture, though costly and subjective.
LLM-as-Judge: uses models to score AI outputs based on some grading criteria, e.g. pointwise scoring or ranking of multiple candidates.
Execution-based verifiers: checks outputs for behavioural correctness (e.g. via unit tests).
Static analysis: cheap and fast checks such as type checking, linting, and security scanning (with a long-term dream of stronger and more costly formal verification).
Evaluation stack: combines these approaches into a hierarchy of checks to ensure high-quality outputs.
During development, additional use benchmarking techniques such as multiple choice quizzes, although these are of no use when faced with novel tasks for which the answer is not known, and are not discussed further below.
Human review is still considered the most trusted option for evaluating AI outputs, but it is limited by scalability and inherent subjectiveness. There is also an "oversight paradox" noted by Harvard Business School (Aug 2024) whereby humans tend to over-rely on AI explanation. Therefore, any human-in-the-loop interfaces must encourage users not just to approve but to actively critique the AI generated outputs.
This is a widely adopted method for assessing open-ended outputs - typically using a stronger (or at least equal) model to grade outputs using pointwise scoring or pairwise comparison. The choice of protocol matters: Pairwise or Pointwise? (Aug 2025) suggests that pointwise scoring is more reliable by producing more consistent results.
As with human judges, LLMs have biases. Surveys in this area (Dec 2024, Oct 2025) categorise these biases as follows:
Presentation-based:
Position: preferring first or last response.
Verbosity/length: preferring longer responses.
Format: favouring Markdown, bullet points.
Content-based:
Token/lexical: sensitivity to specific words, phrases, or lexical choices.
Sentiment: may prefer positive tones.
Contextual: influence from surrounding context, anchoring.
Cognitive:
Self-enhancement: rating own outputs (or similar model families) higher than others.
Overconfidence: excessive certainty in judgments, poor calibration.
Social:
Scoring-based:
Rubric order: ascending (e.g. 1 to 5) vs descending (e.g. 5 to 1) affects judgment.
Score IDs: format of score labels.
Inherent:
Knowledge recency: outdated training data leads to errors on recent topics
Hallucination: fabricating reasoning or citing non-existent criteria
Mitigation strategies include:
Structured prompting with a scoring guide/rubric and chain-of-thought reasoning (G-EVAL, May 2023).
Provide a detailed framework for how to evaluate instead of just asking “is this output good”, such as explicit criteria to assess (e.g. accuracy, completeness, tone), score definitions (e.g. what does a 3 vs 5 mean?), output format requirements (e.g. JSON, specific fields)
Ask the models to explain its reasoning before giving a score so that it “thinks” through the evaluation and produces more consistent scores.
Randomise positions for pairwise comparisons.
Use more than 1 model as a judge.
A recent NeurIPS conference paper (Oct 2025) proposes having a "multi-agent debate judge" framework where models discuss and iteratively refine reasoning rather than voting independently. The paper formalises the debate through probabilistic modelling of the consensus dynamics and statistical stopping criterion. This aims to address the limitation in static voting, where models can share the same blind spot or bias such that the majority view can still lead to a wrong answer.
Another method proposed by MAJ-EVAL (July 2025) is to automatically extract evaluator personas from domain documents (e.g. research papers, guidelines) rather than manually defining them. This helps to address the problem of persona designs in evaluators and generalisability to other tasks.
Verga et al. (May 2024) proposes using a panel of model families (e.g. GPT, Claude, Mistral) instead of a single large model judge. The paper uses max or average voting mechanisms to aggregate scores and found that using evaluators composed of a panel of smaller models helped to reduce intra-model bias, latency and cost.
Use descending score order (5 to 1) - Evaluating Scoring Bias in LLM-as-a-Judge (Jun 2025).
Adopt non-standard score formats (e.g. letter grades rather than roman numerals).
Include reference answers for each score.
Human review for high-stakes decisions - noting that humans are subjective, and it has been observed that reviewers' standards can evolve during assessment (EvalGen, Apr 2024).
An effective way to test code is to run it. Early benchmarks in AI-assisted software development focused on single-function tasks and adopted metrics such as Pass@k (Jul 2021) - checking if at least 1 code sample out of k passes all tests.
More recently, there is a shift towards benchmarks that are based on production code, entire codebases, and are contamination-resistant (e.g. where models have seen the solution to a problem during training). SWE-bench is an example of this kind of real-world problem-based benchmark, where the evaluation dataset is based on real GitHub issues that require patches across complex and multi-file repositories.
Established benchmarks include:
| Benchmarks | Focus |
|---|---|
| LiveCodeBench | Addresses "benchmark saturation" with fresh problems |
| SWE-bench | Real GitHub issue resolution |
| SWE-bench Live | Evolves with repositories so more contamination resistant |
| Context-Bench | Long-horizon context management |
| SWT-Bench | Automated software testing |
| Terminal-Bench | Command-line workflows |
| Cline Bench | Local-first repo workflows |
Using production code introduces new evaluation challenges. OpenAI's collaboration (Aug 2024) with SWE-bench authors found that 68% of the sampled issues were underspecified, with unfair unit tests (overly specific) or other problems, causing models to be systematically penalised.
Established benchmarks provide a way of comparing system performance against other systems, but do not help in evaluating runtime outputs against novel tasks, nor evaluating performance on tasks of greater complexity than or with special properties not covered by the existing suites (e.g. mathematical complexity).
It will therefore be important to both create custom benchmarks for evaluating the performance progression of a system on long-range, complex tasks, and during live execution to generate tests which encode the desired behaviour of the task at hand. This generation step can itself be part of the overall process, and the tests should be continuously checked and refined as the work evolves.
Static validation such as schema and type checking give a cheap and reproducible way to assess outputs before more time-consuming tests, for example:
Schema/format validation, type checking, linting.
Cyclomatic complexity, measuring code path complexity.
Security scanning tools such as Semgrep and CodeQL.
At the strong and costly end of the static analysis spectrum lies formal verification. This currently requires a great deal of time and expertise for even small programs, let alone complex, distributed, and ever-evolving software. Though advances in AI may help to make formal verification a viable option for output evaluation.
The aforementioned approaches are not mutually exclusive, and should be used in a layered evaluation stack pattern consisting of various checks to ensure high quality outputs:
Get notified when new long-reads and articles go live. Follow along as we dive deep into new tech, and share our experiences. No sales stuff.