Skip to content

Research Foundations

Amelia is a multi-agent pipeline that takes a GitHub issue and produces a working fix — an Architect plans, a Developer implements, and a Reviewer gates quality. Each stage traces back to specific research: role specialization from MetaGPT, iterative refinement from AgentCoder and Reflexion, human-in-the-loop gating from HULA, and retrieval-augmented memory from RAPTOR. This document maps those connections.

Research papers come first, followed by industry commentary and the production framework that ties them together.

Research Papers

Multi-Agent Software Engineering

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Hong et al., ICLR 2024

Encodes human Standardized Operating Procedures (SOPs) into LLM-based multi-agent workflows, assigning specialized roles (Product Manager, Architect, Engineer, QA Engineer) that collaborate through an assembly line paradigm. Agents generate structured intermediate outputs — requirements documents, design artifacts, interface specifications — that reduce hallucinations and improve code generation success rates.

Key influence: Amelia's Architect → Developer → Reviewer pipeline mirrors MetaGPT's SOP-encoded role specialization. Structured plan output from the Architect serves as the contract between agents, just as MetaGPT's intermediate artifacts constrain downstream work.

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Huang et al., 2023

Implements a three-agent iterative refinement loop: programmer agent (code generation), test designer agent (test case generation), and test executor agent (execution and feedback). The programmer iteratively refines code based on test execution feedback, achieving 96.3% pass@1 on HumanEval.

Key influence: Direct parallel to the Developer-Reviewer iteration loop. AgentCoder validates that separating generation from evaluation and cycling between them outperforms single-pass generation.

HULA: Human-In-the-Loop Software Development Agents

Atlassian, ICSE SEIP 2025

Industrial framework deployed in JIRA with a three-agent architecture (Planner, Coder, Human). Evaluated on 663 real JIRA issues, achieving 79% plan generation, 82% human approval, and 59% PR merge rate. Engineers review and refine both plans and code before execution.

Key influence: Closest industry validation of Amelia's full flow — plan generation with human approval gates before execution. HULA's results on real issues confirm that human-in-the-loop gating is worth the friction.

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Yang et al., NeurIPS 2024

Introduces a custom agent-computer interface (ACI) designed for LLM agents to navigate repositories, create/edit code, and execute tests. LLM agents benefit from specialized interfaces tailored to their capabilities, not raw terminal access.

Key influence: Amelia's driver abstraction (api vs claude vs codex) reflects the same principle — the interface between agent and environment matters as much as agent capability. Profile-based tool configuration lets each agent get the interface it needs.

Agent Reasoning & Evaluation

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, 2025

Shows that reasoning capabilities emerge from reinforcement learning alone. Introduces GRPO (Group Relative Policy Optimization), which eliminates the critic network by computing advantages relative to group statistics. Shows emergent self-reflection, backtracking, and "aha moment" behaviors from pure RL.

Key influence: Self-verification patterns, rejection sampling for quality filtering (applies to Reviewer), GRPO's group comparison parallels competitive review strategy, multi-stage training pipeline mirrors Architect-Developer-Reviewer flow.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Chiang et al., NeurIPS 2023

Establishes the framework for using LLMs as evaluators. Strong LLMs achieve over 80% agreement with human experts — on par with inter-expert agreement. Systematically examines position bias, verbosity bias, and self-enhancement bias, proposing mitigations for each.

Key influence: Amelia's Reviewer agent implements LLM-as-a-Judge for automated code review. The paper's bias analysis informs how we structure review prompts — avoiding position-dependent evaluation and calibrating verbosity expectations.

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn et al., NeurIPS 2023

Introduces verbal reinforcement through self-reflection rather than weight updates. Agents maintain episodic memory of reflective feedback across attempts, achieving 91% on HumanEval. The reflection signal converts binary success/fail into natural language diagnosis that improves the next attempt.

Key influence: The Developer-Reviewer loop is Reflexion in practice — review feedback becomes the verbal reinforcement signal that steers the next development iteration. Each rejection carries structured reasoning, not just pass/fail.

ReAct: Synergizing Reasoning and Acting in Language Models

Yao et al., ICLR 2023

Interleaves reasoning traces with task-specific actions, allowing agents to call external tools as they reason. Outperforms baselines on interactive decision-making benchmarks using only 1-2 in-context examples.

Key influence: Foundation for how all Amelia agents interleave planning with tool execution. The reasoning trace pattern appears in the Architect's planning phase and the Developer's implementation steps.

Context Management for Long-Horizon Agents

AgentFold: Long-Horizon Web Agents with Proactive Context Management

Tongji Lab / Alibaba Group, 2025

Introduces multi-scale folding for context management: dual-mode condensation (fine-grained) and consolidation (coarse abstraction). Treats context as a "dynamic cognitive workspace" rather than a passive log. Achieves ~7k tokens after 100 turns and scales to 500+ turns.

Key influence: Dynamic state compression for long sessions, proactive context budgeting before saturation, multi-scale state summaries where recent actions stay detailed while older iterations compress.

Context-Folding: Scaling Long-Horizon LLM Agent via Context-Folding

Sun et al., 2025

Proposes branch/return primitives for hierarchical task decomposition, achieving 10x context reduction (32K vs 327K tokens). The FoldGRPO training system provides dense, token-level process rewards for learning effective decomposition.

Key influence: Branch/return semantics for recursive agent decomposition, each review iteration as a "branch" that folds after completion, strategic context compression preserving decision-critical information.

ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

Wu et al., 2025

Enables indefinite exploration through periodic context summarization. ReSum-GRPO integrates segmented trajectory training with advantage broadcasting to train agents on summary-conditioned reasoning.

Key influence: Periodic state compression between orchestration cycles, compact reasoning states instead of full interaction histories, configuration-driven summarization frequency.

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Zhang et al., ICLR 2026

Addresses "brevity bias" (domain insights dropped for concise summaries) and "context collapse" (iterative rewriting erodes details over time). Treats contexts as evolving "playbooks" that accumulate, refine, and organize strategies through a generate-reflect-curate cycle. Uses natural execution feedback for self-improvement without labeled data, achieving +10.6% improvement on agent benchmarks.

Key influence: Directly applicable to the Oracle system and state compression between Developer-Reviewer iterations. The generate-reflect-curate cycle maps to Amelia's iterative refinement loop, and ACE's incremental updates prevent the information loss from naive context summarization.

Recursive Language Models

Zhang, Khattab, Kraska (MIT CSAIL), 2025

Treats long prompts as external environment variables rather than direct inputs. The LLM can call itself on subsets of the context via an llm_query() function within a REPL sandbox. Achieves 10x cost reduction and handles 100x beyond context windows.

Key influence: Treat filesystem as environment variable agents navigate programmatically, recursive sub-agent calls for specific subtasks, sandbox isolation patterns for safe execution.

Benchmarks & Retrieval

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al., ICLR 2024

The defining benchmark for evaluating software engineering agents, with 2,294 real-world GitHub issues from 12 Python repositories. Models receive a codebase and issue description, then must edit code to resolve it. SWE-bench Verified later refined this to a human-validated subset of 500 problems confirmed to be solvable, providing a more reliable evaluation target.

Key influence: SWE-bench frames the task that Amelia's pipeline is designed to solve — given an issue and a codebase, produce a working fix. The benchmark's emphasis on real repositories over synthetic tasks validates building for production codebases.

SWE-Bench+: Enhanced Coding Benchmark for LLMs

Xin et al., 2024

A critical analysis of SWE-bench revealing that 32.67% of successful patches involve "solution leakage" where fixes are provided directly in issue descriptions or comments, and 31.08% of passed patches are suspicious due to weak test cases. SWE-Bench+ filters these issues to produce a more rigorous evaluation.

Key influence: Reinforces that benchmark results need scrutiny — an agent passing tests doesn't mean it understood the problem. Amelia's Reviewer agent serves a similar role: catching superficial fixes that pass tests but miss the underlying issue.

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Tsinghua University / THUDM, 2024

A long-context benchmark covering multiple task types shows reasoning-enhanced models (o1-preview, DeepSeek-R1) significantly outperform standard models. Human experts achieve only 53.7% accuracy, validating benchmark difficulty. Identifies code repository understanding as a distinct skill category.

Key influence: Invest in extended reasoning for Architect/Reviewer agents, specialized prompting for codebase navigation, use reasoning-enhanced models for planning stages.

Long Context vs. RAG for LLMs: An Evaluation and Revisits

Li et al., 2025

A systematic comparison shows RAPTOR (summarization-based retrieval) achieves 38.5% vs 20-22% for chunk-based methods. Self-contained narratives favor Long Context while fragmented sources favor RAG. Context relevance is the most overlooked factor.

Key influence: Single-file analysis uses Long Context, multi-file codebase search uses RAG with summarization. The Knowledge Library's semantic search implements the RAPTOR pattern for hierarchical code understanding. Context quality over quantity drives retrieval design.

ONERULER: Benchmarking Multilingual Long-Context Language Models

Kim et al., 2025

A multilingual extension of RULER shows performance degrades significantly at 128K tokens, models struggle to recognize absent answers, and language mismatch causes up to 20% fluctuation.

Key influence: Focused context extraction over raw context length, handle "issue already resolved" scenarios, consistent language in prompts and analyzed code.

Blog Posts & Talks

How to Build Agents with Filesystems and Bash

Vercel

Argues that the filesystem is the most underrated tool for agent memory and coordination. Files provide persistent state, human-readable artifacts, and natural checkpoints. Bash gives agents the same power that developers already use — composable commands over a shared filesystem.

Key influence: Amelia treats the working directory as the agent's primary workspace. Plan files, code changes, and review feedback all persist as filesystem artifacts. The Oracle stores structured notes across sessions, and the Knowledge Library indexes project files for semantic search.

Claude Code SDK and HaaS

by Vikram Trivedy

Introduced Harness as a Service (HaaS), arguing that agent infrastructure is commoditizing. A harness provides complete runtime environments (context management, tool invocation, permissions, loop control) so developers can focus on domain specialization rather than building infrastructure from scratch.

Key influence: Amelia's driver abstraction (api vs claude vs codex), profile-based configuration, and multi-agent architecture (Architect, Developer, Reviewer as specialized subagents).

Ralph Wiggum as a Software Engineer

by Geoffrey Huntley

Frames LLMs as "deterministically bad in an undeterministic world." Success comes from iteration, not expecting perfection. The Ralph technique represents continuous refinement through an iterative loop where each failure teaches you about gaps in your instructions.

Key influence: Validates the Developer-Reviewer iteration loop. Each review rejection "tunes" the developer like tuning a guitar. Eventual consistency over immediate correctness.

Software is Changing (Again)

by Andrej Karpathy

Introduces Software 3.0: prompts are programs, English is the programming language, and LLMs are the new CPUs. Karpathy frames LLMs as operating systems with context windows as working memory, notes their "jagged intelligence," and advocates partial autonomy ("Iron Man suit") over full autonomy.

Key influence: Human-in-the-loop approval gates, treating prompts as source code (version controlled in profiles), building for LLM consumption.

How to Build an Agent

by Thorsten Ball, Amp

Demonstrates that a functional code-editing AI agent requires under 400 lines of Go — an LLM, a loop, and enough tokens. The agent uses just three tools (read_file, list_files, edit_file) and lets the model autonomously decide when and how to use them. The fundamental intelligence comes from the models themselves; polished products add engineering around that core.

Key influence: Reinforces Amelia's minimal-loop architecture — the orchestrator is a thin state machine around capable models. Agent complexity lives in prompt design and tool selection, not framework overhead.

GPT-5 Oracle

Amp

Amp made GPT-5 its permanent "oracle" model for complex reasoning tasks like architecture review and bug analysis, while keeping a more proactive model (Sonnet) as the primary agent. Users can invoke the oracle at any point in a thread when they need deeper reasoning.

Key influence: Validates Amelia's Oracle agent pattern — a dedicated high-capability model for planning and analysis, separate from the execution agents that do the hands-on coding work.

Methodologies & Frameworks

12-Factor Agents (talk)

HumanLayer

Production-grade patterns for building reliable agentic systems. Amelia's architecture references this more than any other external source.

Key factors implemented:

  • Stateless Reducer Pattern (F12): Frozen models, append-only fields, dict_merge reducers
  • Prompt Templating (F2): Profile-based configuration, externalized prompts
  • Error Self-Healing (F9): Automatic replan on agent failure
  • Immutable State (F12): All state updates return new objects

How These Influenced Amelia

Research PatternAmelia Implementation
MetaGPT SOP RolesArchitect → Developer ↔ Reviewer pipeline
AgentCoder IterationDeveloper-Reviewer cycle until approval
HULA Plan ApprovalHuman-in-the-loop approval gates
SWE-agent ACI DesignDriver abstraction (api vs claude vs codex)
LLM-as-a-JudgeReviewer agent as automated code critic
Reflexion Verbal RLReview feedback as reinforcement signal
ReAct Reasoning+ActingInterleaved planning and tool execution
GRPO Group ComparisonCompetitive review strategy
Agentic Context EngineeringOracle memory and anti-collapse state compression
Context FoldingState compression between iterations
Branch/Return PrimitivesRecursive agent decomposition
RAPTOR RetrievalKnowledge Library semantic search
Filesystem as MemoryOracle and working directory as agent state
HaaS CustomizationProfile-based prompts, tools, context, subagents
12-Factor AgentsStateless, immutable, observable design
SWE-benchTarget problem framing for the pipeline