Build an AI Coding Agent in Python
The promise of AI coding assistants has moved well past autocomplete. A coding agent can accept a natural language task, write the implementation, generate tests, execute those tests in a sandbox, and iterate on failures until the tests pass — all without manual intervention. This tutorial builds exactly that system.
The agent you will create handles the full coding loop: it generates Python code from a specification, writes a pytest test suite, executes the tests in an isolated subprocess, parses the failures, and repairs the code. It also includes a code review step that catches quality issues before tests even run.
Before building, understand how agent sandboxes work and why code execution isolation is non-negotiable for safety.
What You'll Learn#
- How to define tools for code generation, file writing, and test execution
- How to build a repair loop that feeds test failures back to the LLM
- How to sandbox code execution to prevent malicious or runaway code
- How to add a code review step before execution
- How to integrate the agent with a simple CLI interface
Prerequisites#
- Python 3.10+
- OpenAI API key
- Familiarity with LangChain agent patterns
- Understanding of AI agent concepts
Architecture Overview#
The coding agent follows a five-step loop:
- Specification Parser — Extracts function signatures, requirements, and constraints from the user prompt
- Code Generator — Writes the implementation file
- Test Generator — Writes a pytest test file with edge cases
- Code Reviewer — Analyzes the code for quality issues before execution
- Test Runner + Repair Loop — Executes tests, parses failures, sends error context back to the generator, and retries up to N times
Step 1: Setup#
pip install langchain==0.3.0 langchain-openai==0.2.0 python-dotenv==1.0.1 \
pytest==8.3.0 black==24.8.0 ruff==0.6.9
# .env
OPENAI_API_KEY=sk-proj-...
MAX_REPAIR_ITERATIONS=5
EXECUTION_TIMEOUT_SECONDS=30
Step 2: Sandboxed Code Execution Tool#
Never execute LLM-generated code directly in your main process. Use a subprocess with strict resource limits.
# tools/executor.py
import subprocess
import tempfile
import os
import sys
import textwrap
from pathlib import Path
class SandboxedExecutor:
"""Run Python code in an isolated subprocess with timeout."""
def __init__(self, timeout: int = 30):
self.timeout = timeout
def run_pytest(self, code: str, test_code: str) -> dict:
"""Write code and tests to temp files, run pytest, return results."""
with tempfile.TemporaryDirectory() as tmpdir:
tmp_path = Path(tmpdir)
# Write implementation
impl_file = tmp_path / "implementation.py"
impl_file.write_text(code)
# Write tests
test_file = tmp_path / "test_implementation.py"
test_file.write_text(test_code)
# Run pytest in a subprocess with timeout
try:
result = subprocess.run(
[sys.executable, "-m", "pytest", str(test_file), "-v",
"--tb=short", "--no-header", "-q"],
capture_output=True,
text=True,
timeout=self.timeout,
cwd=tmpdir,
env={**os.environ, "PYTHONPATH": tmpdir},
)
return {
"returncode": result.returncode,
"stdout": result.stdout,
"stderr": result.stderr,
"passed": result.returncode == 0,
}
except subprocess.TimeoutExpired:
return {
"returncode": -1,
"stdout": "",
"stderr": f"Execution timed out after {self.timeout}s",
"passed": False,
}
Step 3: LangChain Tools#
Define the tools the agent can call:
# tools/coding_tools.py
from langchain.tools import tool
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from tools.executor import SandboxedExecutor
executor = SandboxedExecutor(timeout=30)
llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
class CodeOutput(BaseModel):
code: str = Field(description="The complete Python implementation code")
explanation: str = Field(description="Brief explanation of the implementation approach")
class TestOutput(BaseModel):
test_code: str = Field(description="Complete pytest test file content")
test_count: int = Field(description="Number of test cases written")
@tool
def generate_code(specification: str) -> dict:
"""Generate Python code from a natural language specification."""
chain = ChatPromptTemplate.from_messages([
("system", """You are an expert Python developer.
Write clean, well-typed Python code following the specification.
Include type hints, docstrings, and handle edge cases.
Return only the implementation — no test code."""),
("human", "Specification:\n{spec}"),
]) | llm.with_structured_output(CodeOutput)
result = chain.invoke({"spec": specification})
return {"code": result.code, "explanation": result.explanation}
@tool
def generate_tests(code: str, specification: str) -> dict:
"""Generate pytest tests for the given implementation."""
chain = ChatPromptTemplate.from_messages([
("system", """You are a Python test engineer.
Write comprehensive pytest tests for this code.
Cover: happy path, edge cases, error cases, boundary values.
Import the implementation from 'implementation' module."""),
("human", "Code:\n{code}\n\nOriginal specification:\n{spec}"),
]) | llm.with_structured_output(TestOutput)
result = chain.invoke({"code": code, "spec": specification})
return {"test_code": result.test_code, "test_count": result.test_count}
@tool
def run_tests(code: str, test_code: str) -> dict:
"""Run pytest tests against the implementation and return results."""
return executor.run_pytest(code, test_code)
@tool
def review_code(code: str) -> str:
"""Review Python code for quality issues, bugs, and style problems."""
chain = ChatPromptTemplate.from_messages([
("system", """You are a senior Python code reviewer.
Identify: bugs, security issues, performance problems, missing error handling.
Be concise — list specific issues with line references if possible.
If code is clean, say 'LGTM'."""),
("human", "{code}"),
]) | llm
result = chain.invoke({"code": code})
return result.content
@tool
def repair_code(code: str, test_output: str, specification: str) -> str:
"""Fix Python code based on failing test output."""
chain = ChatPromptTemplate.from_messages([
("system", """You are debugging Python code.
Read the failing test output carefully.
Fix the implementation to make all tests pass.
Return only the corrected implementation code, no explanations."""),
("human", "Original specification:\n{spec}\n\nFailing code:\n{code}\n\nTest failures:\n{failures}"),
]) | llm
result = chain.invoke({
"spec": specification,
"code": code,
"failures": test_output,
})
return result.content
Step 4: The Coding Agent Loop#
# agent.py
import os
from dotenv import load_dotenv
load_dotenv()
MAX_REPAIRS = int(os.getenv("MAX_REPAIR_ITERATIONS", 5))
def run_coding_agent(specification: str) -> dict:
"""
Full coding agent loop: generate → review → test → repair.
Returns the final code, tests, and execution report.
"""
print(f"\n[Coding Agent] Starting for: {specification[:80]}...")
# Step 1: Generate initial implementation
print("[1/5] Generating implementation...")
gen_result = generate_code.invoke({"specification": specification})
code = gen_result["code"]
print(f" {gen_result['explanation']}")
# Step 2: Code review
print("[2/5] Running code review...")
review = review_code.invoke({"code": code})
if review != "LGTM":
print(f" Review findings: {review[:200]}")
# Repair based on review before even running tests
code = repair_code.invoke({
"code": code,
"test_output": f"Code review findings:\n{review}",
"specification": specification,
})
# Step 3: Generate tests
print("[3/5] Generating test suite...")
test_result = generate_tests.invoke({"code": code, "specification": specification})
test_code = test_result["test_code"]
print(f" Generated {test_result['test_count']} test cases")
# Step 4: Run tests with repair loop
print("[4/5] Running tests...")
for iteration in range(MAX_REPAIRS + 1):
test_run = run_tests.invoke({"code": code, "test_code": test_code})
if test_run["passed"]:
print(f" All tests passed on iteration {iteration + 1}")
break
if iteration == MAX_REPAIRS:
print(f" Max repair iterations reached. Returning best effort.")
break
print(f" Tests failed (iteration {iteration + 1}), repairing...")
failure_context = test_run["stdout"] + "\n" + test_run["stderr"]
code = repair_code.invoke({
"code": code,
"test_output": failure_context,
"specification": specification,
})
return {
"specification": specification,
"final_code": code,
"test_code": test_code,
"tests_passed": test_run.get("passed", False),
"test_output": test_run.get("stdout", ""),
"iterations": iteration + 1,
}
# Import tools at module level for the loop
from tools.coding_tools import generate_code, generate_tests, run_tests, review_code, repair_code
Step 5: CLI Interface#
# cli.py
import argparse
import json
from agent import run_coding_agent
def main():
parser = argparse.ArgumentParser(description="AI Coding Agent")
parser.add_argument("spec", help="Coding specification (natural language)")
parser.add_argument("--output", help="Write final code to this file")
parser.add_argument("--json", action="store_true", help="Output JSON report")
args = parser.parse_args()
result = run_coding_agent(args.spec)
if args.json:
print(json.dumps(result, indent=2))
else:
print("\n" + "="*60)
print("FINAL IMPLEMENTATION")
print("="*60)
print(result["final_code"])
print(f"\nTests {'PASSED' if result['tests_passed'] else 'FAILED'} "
f"after {result['iterations']} iteration(s)")
if args.output and result["final_code"]:
with open(args.output, "w") as f:
f.write(result["final_code"])
print(f"\nCode written to {args.output}")
if __name__ == "__main__":
main()
Run it:
python cli.py "Write a Python function that takes a list of integers and returns \
the top-k most frequent elements. Handle edge cases for empty lists and k > len(list)."
Production Considerations#
For production deployments, review the AI agent security best practices guide — code execution agents carry significant risk if not properly sandboxed. Key hardening steps:
- Run the subprocess executor inside a Docker container with
--network noneand read-only filesystem mounts - Set CPU and memory limits on the execution subprocess using
resource.setrlimit - Block dangerous imports (
os,sys,subprocess) by scanning the generated code before execution - Add Langfuse observability to track repair loop counts and identify specifications that consistently fail
- Use human-in-the-loop approval before the generated code is merged into a real repository
What's Next#
- Add this agent to a larger LangGraph multi-agent system
- Deploy the coding agent as a service using the Docker deployment guide
- Review AI agent testing patterns to test the agent itself
- Explore engineering AI agent use cases for real-world applications
- Read about agent sandboxes for deeper security design patterns