Terminal output showing automated test results from an AI coding workflow — Photo by Shahadat Rahman on Unsplash

Build an AI Coding Agent in Python

The promise of AI coding assistants has moved well past autocomplete. A coding agent can accept a natural language task, write the implementation, generate tests, execute those tests in a sandbox, and iterate on failures until the tests pass — all without manual intervention. This tutorial builds exactly that system.

The agent you will create handles the full coding loop: it generates Python code from a specification, writes a pytest test suite, executes the tests in an isolated subprocess, parses the failures, and repairs the code. It also includes a code review step that catches quality issues before tests even run.

Before building, understand how agent sandboxes work and why code execution isolation is non-negotiable for safety.

What You'll Learn#

How to define tools for code generation, file writing, and test execution
How to build a repair loop that feeds test failures back to the LLM
How to sandbox code execution to prevent malicious or runaway code
How to add a code review step before execution
How to integrate the agent with a simple CLI interface

Prerequisites#

Python 3.10+
OpenAI API key
Familiarity with LangChain agent patterns
Understanding of AI agent concepts

Architecture Overview#

The coding agent follows a five-step loop:

Specification Parser — Extracts function signatures, requirements, and constraints from the user prompt
Code Generator — Writes the implementation file
Test Generator — Writes a pytest test file with edge cases
Code Reviewer — Analyzes the code for quality issues before execution
Test Runner + Repair Loop — Executes tests, parses failures, sends error context back to the generator, and retries up to N times

Step 1: Setup#

pip install langchain==0.3.0 langchain-openai==0.2.0 python-dotenv==1.0.1 \
    pytest==8.3.0 black==24.8.0 ruff==0.6.9

# .env
OPENAI_API_KEY=sk-proj-...
MAX_REPAIR_ITERATIONS=5
EXECUTION_TIMEOUT_SECONDS=30

Step 2: Sandboxed Code Execution Tool#

Never execute LLM-generated code directly in your main process. Use a subprocess with strict resource limits.

# tools/executor.py
import subprocess
import tempfile
import os
import sys
import textwrap
from pathlib import Path

class SandboxedExecutor:
    """Run Python code in an isolated subprocess with timeout."""

    def __init__(self, timeout: int = 30):
        self.timeout = timeout

    def run_pytest(self, code: str, test_code: str) -> dict:
        """Write code and tests to temp files, run pytest, return results."""
        with tempfile.TemporaryDirectory() as tmpdir:
            tmp_path = Path(tmpdir)

            # Write implementation
            impl_file = tmp_path / "implementation.py"
            impl_file.write_text(code)

            # Write tests
            test_file = tmp_path / "test_implementation.py"
            test_file.write_text(test_code)

            # Run pytest in a subprocess with timeout
            try:
                result = subprocess.run(
                    [sys.executable, "-m", "pytest", str(test_file), "-v",
                     "--tb=short", "--no-header", "-q"],
                    capture_output=True,
                    text=True,
                    timeout=self.timeout,
                    cwd=tmpdir,
                    env={**os.environ, "PYTHONPATH": tmpdir},
                )
                return {
                    "returncode": result.returncode,
                    "stdout": result.stdout,
                    "stderr": result.stderr,
                    "passed": result.returncode == 0,
                }
            except subprocess.TimeoutExpired:
                return {
                    "returncode": -1,
                    "stdout": "",
                    "stderr": f"Execution timed out after {self.timeout}s",
                    "passed": False,
                }

Step 3: LangChain Tools#

Define the tools the agent can call:

# tools/coding_tools.py
from langchain.tools import tool
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from tools.executor import SandboxedExecutor

executor = SandboxedExecutor(timeout=30)
llm = ChatOpenAI(model="gpt-4o", temperature=0.1)

class CodeOutput(BaseModel):
    code: str = Field(description="The complete Python implementation code")
    explanation: str = Field(description="Brief explanation of the implementation approach")

class TestOutput(BaseModel):
    test_code: str = Field(description="Complete pytest test file content")
    test_count: int = Field(description="Number of test cases written")

@tool
def generate_code(specification: str) -> dict:
    """Generate Python code from a natural language specification."""
    chain = ChatPromptTemplate.from_messages([
        ("system", """You are an expert Python developer.
Write clean, well-typed Python code following the specification.
Include type hints, docstrings, and handle edge cases.
Return only the implementation — no test code."""),
        ("human", "Specification:\n{spec}"),
    ]) | llm.with_structured_output(CodeOutput)

    result = chain.invoke({"spec": specification})
    return {"code": result.code, "explanation": result.explanation}

@tool
def generate_tests(code: str, specification: str) -> dict:
    """Generate pytest tests for the given implementation."""
    chain = ChatPromptTemplate.from_messages([
        ("system", """You are a Python test engineer.
Write comprehensive pytest tests for this code.
Cover: happy path, edge cases, error cases, boundary values.
Import the implementation from 'implementation' module."""),
        ("human", "Code:\n{code}\n\nOriginal specification:\n{spec}"),
    ]) | llm.with_structured_output(TestOutput)

    result = chain.invoke({"code": code, "spec": specification})
    return {"test_code": result.test_code, "test_count": result.test_count}

@tool
def run_tests(code: str, test_code: str) -> dict:
    """Run pytest tests against the implementation and return results."""
    return executor.run_pytest(code, test_code)

@tool
def review_code(code: str) -> str:
    """Review Python code for quality issues, bugs, and style problems."""
    chain = ChatPromptTemplate.from_messages([
        ("system", """You are a senior Python code reviewer.
Identify: bugs, security issues, performance problems, missing error handling.
Be concise — list specific issues with line references if possible.
If code is clean, say 'LGTM'."""),
        ("human", "{code}"),
    ]) | llm

    result = chain.invoke({"code": code})
    return result.content

@tool
def repair_code(code: str, test_output: str, specification: str) -> str:
    """Fix Python code based on failing test output."""
    chain = ChatPromptTemplate.from_messages([
        ("system", """You are debugging Python code.
Read the failing test output carefully.
Fix the implementation to make all tests pass.
Return only the corrected implementation code, no explanations."""),
        ("human", "Original specification:\n{spec}\n\nFailing code:\n{code}\n\nTest failures:\n{failures}"),
    ]) | llm

    result = chain.invoke({
        "spec": specification,
        "code": code,
        "failures": test_output,
    })
    return result.content

Terminal showing the coding agent's repair loop: failing tests on iteration 1, passing tests on iteration 3

Step 4: The Coding Agent Loop#

# agent.py
import os
from dotenv import load_dotenv

load_dotenv()

MAX_REPAIRS = int(os.getenv("MAX_REPAIR_ITERATIONS", 5))

def run_coding_agent(specification: str) -> dict:
    """
    Full coding agent loop: generate → review → test → repair.
    Returns the final code, tests, and execution report.
    """
    print(f"\n[Coding Agent] Starting for: {specification[:80]}...")

    # Step 1: Generate initial implementation
    print("[1/5] Generating implementation...")
    gen_result = generate_code.invoke({"specification": specification})
    code = gen_result["code"]
    print(f"      {gen_result['explanation']}")

    # Step 2: Code review
    print("[2/5] Running code review...")
    review = review_code.invoke({"code": code})
    if review != "LGTM":
        print(f"      Review findings: {review[:200]}")
        # Repair based on review before even running tests
        code = repair_code.invoke({
            "code": code,
            "test_output": f"Code review findings:\n{review}",
            "specification": specification,
        })

    # Step 3: Generate tests
    print("[3/5] Generating test suite...")
    test_result = generate_tests.invoke({"code": code, "specification": specification})
    test_code = test_result["test_code"]
    print(f"      Generated {test_result['test_count']} test cases")

    # Step 4: Run tests with repair loop
    print("[4/5] Running tests...")
    for iteration in range(MAX_REPAIRS + 1):
        test_run = run_tests.invoke({"code": code, "test_code": test_code})

        if test_run["passed"]:
            print(f"      All tests passed on iteration {iteration + 1}")
            break

        if iteration == MAX_REPAIRS:
            print(f"      Max repair iterations reached. Returning best effort.")
            break

        print(f"      Tests failed (iteration {iteration + 1}), repairing...")
        failure_context = test_run["stdout"] + "\n" + test_run["stderr"]
        code = repair_code.invoke({
            "code": code,
            "test_output": failure_context,
            "specification": specification,
        })

    return {
        "specification": specification,
        "final_code": code,
        "test_code": test_code,
        "tests_passed": test_run.get("passed", False),
        "test_output": test_run.get("stdout", ""),
        "iterations": iteration + 1,
    }

# Import tools at module level for the loop
from tools.coding_tools import generate_code, generate_tests, run_tests, review_code, repair_code

Step 5: CLI Interface#

# cli.py
import argparse
import json
from agent import run_coding_agent

def main():
    parser = argparse.ArgumentParser(description="AI Coding Agent")
    parser.add_argument("spec", help="Coding specification (natural language)")
    parser.add_argument("--output", help="Write final code to this file")
    parser.add_argument("--json", action="store_true", help="Output JSON report")
    args = parser.parse_args()

    result = run_coding_agent(args.spec)

    if args.json:
        print(json.dumps(result, indent=2))
    else:
        print("\n" + "="*60)
        print("FINAL IMPLEMENTATION")
        print("="*60)
        print(result["final_code"])
        print(f"\nTests {'PASSED' if result['tests_passed'] else 'FAILED'} "
              f"after {result['iterations']} iteration(s)")

    if args.output and result["final_code"]:
        with open(args.output, "w") as f:
            f.write(result["final_code"])
        print(f"\nCode written to {args.output}")

if __name__ == "__main__":
    main()

Run it:

python cli.py "Write a Python function that takes a list of integers and returns \
  the top-k most frequent elements. Handle edge cases for empty lists and k > len(list)."

Production Considerations#

For production deployments, review the AI agent security best practices guide — code execution agents carry significant risk if not properly sandboxed. Key hardening steps:

Run the subprocess executor inside a Docker container with --network none and read-only filesystem mounts
Set CPU and memory limits on the execution subprocess using resource.setrlimit
Block dangerous imports (os, sys, subprocess) by scanning the generated code before execution
Add Langfuse observability to track repair loop counts and identify specifications that consistently fail
Use human-in-the-loop approval before the generated code is merged into a real repository

What's Next#

Add this agent to a larger LangGraph multi-agent system
Deploy the coding agent as a service using the Docker deployment guide
Review AI agent testing patterns to test the agent itself
Explore engineering AI agent use cases for real-world applications
Read about agent sandboxes for deeper security design patterns