Vellum: Complete Platform Profile
Vellum is a platform for building, testing, evaluating, and deploying LLM-powered applications in production. Founded in 2023 and backed by Y Combinator, Vellum addresses a specific and important problem: the gap between prototyping an AI application in a notebook or playground and running that application reliably at production scale.
The platform is designed for software engineering teams that are already building with LLMs and need infrastructure for prompt versioning, systematic evaluation, workflow orchestration, and performance monitoring. Vellum occupies the LLMOps category — a discipline focused on the operational challenges of running large language models in production, analogous to how DevOps and MLOps address operational challenges in traditional software and machine learning. For broader context on the AI agent tooling landscape, see the AI Agents profiles directory.
Overview#
Vellum's founding premise was that shipping LLM-powered features to production is fundamentally different from shipping traditional software — and that the tooling available in 2023 was inadequate to address the difference. The specific challenges Vellum targets are: prompt brittleness (small prompt changes causing unexpected regressions), model selection complexity (choosing among dozens of models with different cost, latency, and capability tradeoffs), evaluation difficulty (defining and measuring "better" for generative outputs), and observability gaps (understanding what your production LLM application is actually doing).
The platform has evolved from a prompt management tool into a broader AI application development platform. The current product covers four interconnected areas: Prompts (versioning and deployment), Documents (RAG and knowledge management), Workflows (multi-step AI pipeline orchestration), and Evaluations (testing and quality assurance).
Vellum targets product engineering teams at companies actively shipping AI features — not researchers, not data scientists running experiments, but engineers responsible for production reliability of AI-powered products. This focus on the production engineering persona shapes every product decision: the emphasis on versioning, CI/CD integration, regression testing, and observability.
The platform is available via API and SDK (Python and TypeScript) in addition to a web dashboard. Most serious users interact primarily through the SDK and use the dashboard for monitoring and evaluation review rather than day-to-day workflow.
Core Features#
Prompt Management and Versioning#
Vellum treats prompts as versioned software artifacts rather than ad hoc strings. Each prompt template is a named entity with a revision history, deployment labels (development, staging, production), and a rich editor that supports Jinja2 templating, variable injection, and multi-message (system, user, assistant) structures.
This versioning model allows teams to: roll back a prompt that caused a regression, A/B test two prompt variants in production, audit which prompt version was active at the time of a specific production incident, and approve prompt changes through a review workflow before deployment. For teams that have experienced the chaos of prompt changes breaking production AI features, this infrastructure is directly valuable.
Prompt deployment uses labels rather than fixed versions — so updating the prompt tagged "production" instantly affects all callers without code changes, enabling prompt-level continuous delivery.
Evaluation Framework#
Vellum's evaluation system is the platform's most mature and differentiated feature. Evaluations allow teams to define test cases (input/expected output pairs) and evaluate prompt performance across those test cases automatically. The evaluation engine supports both quantitative metrics (exact match, ROUGE, BLEU for generation tasks; accuracy for classification tasks) and AI-graded metrics (using an LLM evaluator to score outputs on rubrics like helpfulness, accuracy, and tone).
Evaluation runs can be triggered manually, on a schedule, or as part of a CI/CD pipeline on every code push. The platform stores evaluation history so teams can track performance trends over time and detect regressions before they reach production.
This systematic evaluation infrastructure is what separates Vellum from simple prompt playgrounds. Teams that use evaluation seriously tend to catch prompt regressions before users do — a significant reliability improvement. See our agent evaluation glossary entry for context on why evaluation is foundational to production AI.
Workflow Builder#
Vellum's workflow builder enables multi-step AI pipelines — chains of LLM calls, conditional branching, retrieval steps, code execution, and API calls — designed visually and deployed via API. Workflows support complex patterns like retrieval-augmented generation, chain-of-thought decomposition, multi-model routing, and structured output extraction.
Workflows are versioned and evaluated with the same rigor as individual prompts. Teams can compare two workflow configurations against the same test suite, measure latency and cost differences, and gate workflow promotion to production on evaluation pass rates.
The workflow builder is lower-level than Relevance AI's no-code agent builder — it exposes the mechanics of LLM pipeline design rather than abstracting them. This is appropriate for engineering teams that want control and transparency but may be excessive complexity for non-technical users.
Document Management and RAG#
Vellum includes a document management system for building retrieval-augmented generation pipelines. Teams upload documents, configure chunking strategies (by paragraph, sentence, or fixed token count), choose embedding models, and query the resulting index from prompts or workflows.
Document indexing supports multiple chunking and embedding configurations simultaneously, allowing teams to compare retrieval quality across configurations using Vellum's evaluation framework. This is a meaningful advantage over platforms that expose a single fixed RAG pipeline — teams can make evidence-based choices about chunking and embedding strategy rather than guessing.
Model Routing and Provider Management#
Vellum provides a unified API that abstracts over multiple LLM providers — OpenAI, Anthropic, Google Gemini, Cohere, Mistral, Meta Llama, and others. Teams configure which model handles each prompt or workflow step and can implement fallback chains (if the primary model is unavailable or exceeds latency thresholds, route to a backup).
This abstraction layer also simplifies API key management: one set of Vellum credentials rather than separately managing API keys for each provider. Usage tracking and cost attribution are consolidated across providers in the Vellum dashboard.
Observability and Monitoring#
Every production call through Vellum is logged with full request and response capture, latency metrics, token counts, cost attribution, and model metadata. The observability dashboard surfaces aggregate metrics (p50/p95/p99 latency, error rates, cost trends) and enables drilling into individual traces for debugging.
Custom metadata can be attached to requests (user ID, session ID, feature name) for filtering and segmentation. Alert rules can trigger on metric thresholds — latency above X, error rate above Y, cost above Z — with notification to Slack or PagerDuty.
Pricing & Plans#
Starter Plan (Free): Includes limited API call logging, prompt management for a small number of prompts, and access to the evaluation framework with limited test case storage. Suitable for individual developers evaluating the platform.
Growth Plan: Approximately $200/month. Expands logging retention, increases prompt and workflow limits, adds team collaboration features, and unlocks higher evaluation run volumes. This is the standard plan for startup teams actively shipping AI features.
Business Plan: Approximately $600–$1,000/month (varies by usage). Adds SSO, audit logging, higher rate limits, dedicated support, and advanced evaluation features. Designed for engineering teams at Series A and beyond.
Enterprise Plan: Custom pricing. Adds VPC deployment, dedicated infrastructure, custom data retention policies, SLA guarantees, and professional services. Enterprise customers with strict data residency requirements often use VPC deployment to keep production traces out of Vellum's shared infrastructure.
LLM API costs (calls to OpenAI, Anthropic, etc.) are passed through at cost — Vellum adds a small markup (typically a percentage of API spend) at higher tiers, or passes costs through at parity on enterprise plans.
Strengths#
Systematic evaluation is category-leading. Vellum's evaluation framework — with CI/CD integration, AI-graded rubrics, and performance trend tracking — is among the most mature in the LLMOps space. Teams that use it seriously significantly reduce regression risk in AI-powered features.
Prompt versioning and deployment discipline. Treating prompts as versioned, deployable artifacts with environment labels solves a real operational problem that engineering teams running LLMs in production consistently encounter.
Model-agnostic abstraction. The unified API over multiple providers simplifies model experimentation and provider redundancy. Switching a production prompt from GPT-4o to Claude does not require code changes — only a configuration update in Vellum.
Engineering-native design. SDK-first, version-control-friendly, CI/CD-integrated. Vellum is designed to fit into existing engineering workflows rather than requiring adoption of a new development paradigm.
RAG pipeline evaluation. The ability to systematically evaluate retrieval quality across different chunking and embedding configurations is a capability that most teams building RAG applications need but rarely have access to.
Limitations#
Less suited to non-technical users. Vellum has a dashboard but it is not a no-code tool. Non-engineers will find the platform difficult to use effectively. For teams that need business users to configure and manage AI workflows, Relevance AI or similar no-code platforms are more appropriate.
Workflow builder is not a full agent framework. Vellum's workflows handle multi-step pipelines well but do not provide the full agentic loop infrastructure — tool calling, memory management, autonomous goal pursuit — that platforms like Relevance AI or frameworks like LangChain provide.
Community and ecosystem are smaller than older alternatives. Vellum was founded in 2023 and its community, integration ecosystem, and available templates are smaller than more established platforms. Teams benefit from first-party documentation but have less community-generated content to draw from.
Cost model at scale requires management. For teams running very high API call volumes, the combination of Vellum platform fees plus passed-through LLM API costs requires active cost management. High-volume applications should model total cost carefully before scaling on Vellum.
Ideal Use Cases#
Product engineering teams shipping LLM features. Vellum is specifically designed for software engineers building and maintaining AI-powered product features — summarization, classification, generation, extraction — who need production reliability and observability.
Teams that have experienced prompt regression pain. Organizations where prompt changes have caused production incidents, or where multiple team members are editing prompts without coordination, will find Vellum's versioning and evaluation infrastructure directly addresses their pain.
AI applications requiring multi-model evaluation. Teams that want to choose intelligently among competing LLMs — comparing GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro on their specific tasks — benefit from Vellum's evaluation framework and unified API.
RAG applications requiring retrieval quality management. Document-grounded AI applications (support knowledge bases, internal search, document Q&A) where retrieval quality directly affects user experience benefit from Vellum's structured approach to RAG pipeline evaluation.
Getting Started#
Vellum's recommended starting point is the prompt management feature. Sign up, create a prompt template, and make a test call via the Python or TypeScript SDK. The SDK documentation is the primary onboarding path — the web dashboard is most useful once you have a working integration.
The next step is setting up an evaluation: define test cases against your prompt, run an evaluation, and use the results to establish a performance baseline. From this baseline, you can start making prompt improvements with confidence that regressions will be detected.
Workflow integration follows: identify a multi-step LLM pipeline in your product, recreate it in Vellum's workflow builder, and evaluate the workflow output the same way you evaluate individual prompts.
Teams building knowledge bases should review the how to train an AI agent on your own data tutorial for foundational concepts on document processing, chunking, and retrieval evaluation that apply directly to Vellum's document management features.
How It Compares#
Vellum vs LangSmith. LangSmith (by LangChain) offers similar LLM observability and evaluation capabilities with deep LangChain integration. Vellum is model-agnostic and does not require LangChain adoption. Teams building with LangChain should evaluate LangSmith closely. Teams building with multiple frameworks or raw API calls will likely prefer Vellum's model-agnostic approach.
Vellum vs Promptfoo. Promptfoo is an open-source prompt testing tool — narrower in scope (evaluation only, no deployment or observability) but free and self-hosted. Vellum is a more complete platform with deployment, monitoring, and workflow features. For teams that only need evaluation and want open-source, Promptfoo is a viable alternative.
Vellum vs Relevance AI. These platforms serve different needs. Vellum is LLMOps infrastructure for engineers building AI features. Relevance AI is a no-code platform for building AI agents and automating business workflows. They can coexist: Relevance AI for business user-facing workflows, Vellum for engineering-owned AI feature infrastructure.
Bottom Line#
Vellum is the right platform for engineering teams that are serious about production reliability in their LLM-powered products. Its evaluation framework, prompt versioning, and observability capabilities address real operational pain that teams running AI in production consistently encounter.
The platform is not for non-technical users, not a replacement for full agent orchestration frameworks, and not the lowest-cost path to simple AI integration. But for engineering teams who understand why production AI needs the same rigor as production software — versioning, testing, monitoring, rollback — Vellum provides the infrastructure to get there.
Browse more platform profiles in the AI Agents directory. Understand how agents handle observability in the agent observability glossary entry.