Vellum is the LLM development platform for teams who have moved past prototype and are dealing with production realities — inconsistent outputs, prompt regression, evaluation bottlenecks, and the challenge of managing multiple LLM workflows across a growing application. Where LangChain and similar frameworks address the code architecture of LLM applications, Vellum addresses the operational discipline: systematic prompt versioning, structured evaluation, and deployable workflow orchestration.
For product and engineering teams that have learned the hard way that prompts need to be managed like code and outputs need to be evaluated like software, Vellum provides the infrastructure to make that discipline sustainable at scale.
What Vellum Actually Is#
Vellum is a platform, not a framework. It sits above LLM APIs and provides:
Prompt Management: A centralized registry for prompts with version control, environment promotion (dev → staging → production), and a visual diff view for comparing prompt versions. Prompts are called through Vellum's API rather than hardcoded in application code.
Evaluation: Test case management, automated metric scoring, and human review workflows for systematically measuring LLM output quality. Evaluations run on specific prompt versions before production deployment.
Workflow Builder: A visual editor for multi-step LLM pipelines that combines prompt nodes, document search, conditional logic, API calls, and code execution. Workflows are deployable through a single API endpoint.
Document Search: Built-in vector storage and retrieval for RAG workflows. Upload documents, and Vellum handles chunking, embedding, and retrieval — no separate vector database required for standard use cases.
Deployments and Monitoring: Production deployment management with usage metrics, cost tracking, latency monitoring, and output logging across all deployed prompts and workflows.
Prompt Management and Versioning#
Vellum's prompt versioning works like Git for prompts. Each prompt has a deployment label that application code calls:
import vellum
client = vellum.Vellum(api_key="your-vellum-api-key")
# Call a deployed prompt by its deployment name
# Vellum resolves "customer-support-classifier" to the
# currently active production version
response = client.execute_prompt(
prompt_deployment_name="customer-support-classifier",
inputs=[
vellum.PromptRequestStringInput(
name="customer_message",
value="I'm having trouble with my order from last week"
),
vellum.PromptRequestStringInput(
name="available_categories",
value="shipping, billing, product_quality, account_issues"
)
]
)
print(response.outputs[0].value)
# Output: "shipping"
The application code never changes when prompts evolve. Rolling back a bad prompt deployment is a Vellum UI operation — no code deployment required:
# Vellum API: promote a previous version to production
# (Typically done through Vellum's UI, but also available via API)
vellum promotions create \
--prompt-deployment customer-support-classifier \
--from-version v12 \
--to-environment production
Evaluation Framework#
Vellum's evaluation system structures the quality assurance process that teams often do informally:
# Programmatic test case creation and evaluation run
import vellum
client = vellum.Vellum(api_key="your-vellum-api-key")
# Create test cases for a prompt deployment
test_cases = [
{
"input": {
"customer_message": "My package hasn't arrived yet",
"available_categories": "shipping, billing, product_quality, account_issues"
},
"expected_output": "shipping",
"label": "Late delivery complaint — should classify as shipping"
},
{
"input": {
"customer_message": "I was charged twice for my order",
"available_categories": "shipping, billing, product_quality, account_issues"
},
"expected_output": "billing",
"label": "Double charge — should classify as billing"
},
{
"input": {
"customer_message": "The item arrived broken",
"available_categories": "shipping, billing, product_quality, account_issues"
},
"expected_output": "product_quality",
"label": "Damaged product — should classify as product_quality"
}
]
# Run evaluation on a specific prompt version before promoting to production
# Vellum runs each test case and scores against defined metrics
evaluation_run = client.evaluations.create(
prompt_deployment_name="customer-support-classifier",
prompt_version="v15", # Version being evaluated before promotion
test_suite_name="classification-accuracy-v2",
evaluators=[
{"type": "exact_match", "weight": 1.0}, # Category must match exactly
]
)
print(f"Evaluation ID: {evaluation_run.id}")
print(f"Status: {evaluation_run.status}")
# Use Vellum UI or poll API for results
For more nuanced evaluation — assessing response quality, coherence, or helpfulness — Vellum supports LLM-as-judge evaluation and human review workflows where team members rate outputs on custom rubrics.
Workflow Builder for Agent Architectures#
Vellum's workflow builder supports complex multi-step AI pipelines through a visual interface:
# Calling a Vellum workflow from application code
# The workflow is built visually in Vellum Studio and deployed as an API
client = vellum.Vellum(api_key="your-vellum-api-key")
# Multi-step agent workflow: research → analyze → respond
response = client.execute_workflow(
workflow_deployment_name="customer-research-agent",
inputs=[
vellum.WorkflowRequestStringInput(
name="customer_id",
value="cust_12345"
),
vellum.WorkflowRequestStringInput(
name="inquiry",
value="Customer asking about upgrading their plan"
)
]
)
# Workflow internally:
# 1. Document search: retrieves customer history from vector store
# 2. LLM node: analyzes customer profile and inquiry
# 3. Conditional: routes to upgrade or retention workflow path
# 4. LLM node: generates personalized response
for output in response.outputs:
if output.name == "agent_response":
print(output.value)
The visual workflow builder supports:
- LLM Nodes: Call any supported model with a configured prompt
- Search Nodes: Query document collections for RAG context
- Conditional Nodes: Branch based on LLM output or input values
- Subworkflow Nodes: Nest workflows for reusability
- API Nodes: HTTP calls to external services
- Code Nodes: Execute Python logic inline
Document Search and RAG#
Vellum provides built-in document indexing and retrieval for RAG workflows:
# Upload documents to Vellum's vector store
client.documents.upload(
label="Product Knowledge Base",
contents="[document content here]",
add_to_index_names=["product-kb-index"],
metadata={"category": "product_specs", "version": "2026-Q1"}
)
# Search in a workflow or directly
search_results = client.search(
index_name="product-kb-index",
query="What are the storage limits for the Pro plan?",
options=vellum.SearchRequestOptionsRequest(
limit=5,
weights=vellum.SearchWeightsRequest(
semantic_similarity=0.8,
keywords=0.2
),
filters=vellum.SearchFiltersRequest(
metadata={"category": "product_specs"}
)
)
)
For teams without specialized retrieval requirements, Vellum's managed vector store eliminates the need for separate Pinecone, Weaviate, or pgvector infrastructure.
Multi-Model Support and A/B Testing#
Vellum's unified model layer supports testing the same prompt across multiple providers:
| Provider | Models Available |
|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, GPT-4-turbo |
| Anthropic | Claude Opus 4.6, Claude Sonnet 4.6, Haiku |
| Gemini 1.5 Pro, Gemini 1.5 Flash | |
| Mistral | Mistral Large, Mistral Medium |
| Meta | Llama 3.1 70B, Llama 3.1 8B |
| Custom | Any OpenAI-compatible endpoint |
A/B testing in Vellum means creating two prompt deployments (e.g., "classifier-gpt4o" and "classifier-claude") and comparing evaluation results on the same test suite before making a model switch decision.
Pricing Breakdown#
| Tier | Cost | Includes |
|---|---|---|
| Sandbox | Free | Development/testing use |
| Starter | $99/month | 100K prompt executions, basic evaluations |
| Growth | $599/month | 1M+ executions, advanced evaluations, team features |
| Enterprise | Custom | Unlimited, SSO, custom SLAs |
Pricing scales with prompt execution volume. Teams should model expected monthly execution counts carefully — the gap between Starter and Growth is significant, and teams with moderate production traffic can quickly reach Growth tier thresholds.
Pros#
Prompt lifecycle management: Version control, environment promotion, and rollback for prompts provides the operational discipline that prevents "which version of this prompt is actually running in production?" problems.
Systematic evaluation: Test case management with automated scoring and human review workflows makes evaluation a repeatable process rather than ad-hoc manual checking before deploys.
Workflow versatility: Visual workflow builder supports complex agent architectures with conditional logic, RAG, external APIs, and code nodes — deployable through a single API without custom orchestration code.
Multi-model flexibility: Unified API across OpenAI, Anthropic, Google, Mistral, and others simplifies model switching and A/B testing without application code changes.
Cons#
Pricing scale: Growth tier at $599/month requires genuine volume and team adoption to justify. Smaller teams may find direct API usage with lightweight tooling more cost-effective.
Platform dependency: Building critical prompt logic and workflows in Vellum's system creates vendor dependency. Migration requires re-implementing versioning, evaluation, and deployment infrastructure.
Community size: Smaller ecosystem than LangChain/LangGraph means fewer third-party tutorials, less community troubleshooting history, and a smaller pool of engineers with Vellum-specific expertise.
Who Should Use Vellum#
Strong fit:
- Product teams with deployed LLM applications where prompt regressions are causing production quality issues
- Teams with multiple AI workflows that need systematic evaluation before deployment
- Engineering teams that want prompt management without building custom internal tooling
- Organizations with multiple LLM use cases where model-agnostic infrastructure has value
Poor fit:
- Teams in early prototype phase where the production discipline overhead isn't justified yet
- Developers building tightly code-integrated LangChain pipelines where LangSmith's native integration is more natural
- Organizations with strict on-premises requirements where cloud-hosted prompt storage is not acceptable
- Very high-volume teams where Vellum's execution-based pricing makes direct API management more economical
Verdict#
Vellum earns a 4.1/5 rating. For teams who have shipped LLM applications and are dealing with the operational realities — prompt drift, inconsistent quality, no systematic testing before deploys — Vellum provides exactly the infrastructure needed. The prompt versioning model, evaluation framework, and visual workflow builder address real production pain points.
The platform dependency and pricing scale are legitimate concerns. Teams should evaluate whether the operational benefits justify the cost relative to lightweight alternatives. But for teams running multiple LLM products with real stakes on quality consistency, Vellum's ROI case is strong.
Related Resources#
- LangGraph Review — Code-first alternative for complex agent workflows
- n8n Review — Open-source automation for agent workflows
- Dify Review — Open-source LLM app development platform
- Vellum in the AI Agent Directory
- Agentic RAG Glossary Term — RAG architecture Vellum implements
- Agent Tracing Glossary Term — Observability concepts in Vellum
Frequently Asked Questions#
What is Vellum and what problem does it solve?#
Vellum is an LLM development platform for production AI applications. It solves prompt management (versioning and deployment), evaluation (systematic quality testing before deploys), and workflow orchestration (multi-step AI pipelines). Teams typically discover Vellum after shipping LLM applications and realizing they need more discipline than ad-hoc prompt management in application code.
How does Vellum prompt versioning work?#
Prompts are stored in Vellum's registry as versioned objects. Application code calls a deployment alias (e.g., "my-prompt-production") that resolves to the current production version. Changes create new versions that can be evaluated before promotion. Rollback is a Vellum UI operation — no code deployment required.
How does Vellum evaluation work?#
Vellum's evaluation system lets teams define test cases (input + expected output) and run automated scoring on prompt versions. Metrics include exact match, semantic similarity, LLM-as-judge, and custom Python evaluators. Human review workflows support subjective quality assessment. Evaluations can run in CI/CD pipelines before production deployment.
Can Vellum build full AI agent workflows?#
Yes — Vellum's workflow builder supports multi-step pipelines with LLM nodes, document search (RAG), conditional branching, API calls, and code execution. Complex agent patterns (plan-and-execute, iterative refinement) can be implemented visually and deployed through a single API endpoint.
How does Vellum compare to LangSmith?#
LangSmith integrates deeply with LangChain for tracing, logging, and evaluation of LangChain-based applications. Vellum is framework-agnostic, works with any LLM SDK, and emphasizes prompt versioning and management more strongly. Teams building with LangChain often find LangSmith natural; teams with custom pipelines or multi-framework environments often prefer Vellum.