LLM observability tools are exploding in popularity—LangSmith, Helicone, Langfuse, and dozens more. But they all share a problem: they get expensive fast, especially at scale. If you're running thousands of LLM calls per day, you might be better off building your own observability stack. Here's exactly how to do it.
What You Actually Need to Track
Before building anything, define what you need to observe:
Token Usage: Input and output tokens per request. This directly maps to cost.
Latency: Time to first token (TTFT) and total completion time. Critical for user experience.
Error Rates: Failed requests, rate limits, malformed responses.
Context: What triggered the request? User ID, session, agent type, etc.
Quality Signals: User feedback, automated evaluations, detected hallucinations.
Most teams start by tracking everything, then realize they only look at 20% of the data. Start minimal.
The Architecture
Our recommended stack for self-hosted LLM observability:
Logging Layer: Wrap your LLM calls in a logging function that captures request/response metadata. Store raw logs in S3 (cheap, compliant).
Processing Layer: A batch job (Lambda, Airflow, or Mage) that runs every 15 minutes. It reads raw logs, extracts metrics, and writes to your analytics database.
Storage Layer: ClickHouse for time-series metrics. It handles billions of rows and returns queries in milliseconds. Postgres works too for smaller scale.
Visualization: Grafana dashboards. Open source, battle-tested, and your team probably already knows it.
Implementation Details
The logging wrapper is the key piece. Here's the concept:
Before every LLM call, capture: timestamp, model, prompt length, and context (user ID, agent type, etc.).
After the call, capture: completion tokens, total latency, response status, and any errors.
Write this to a log file or send directly to S3. Don't process it inline—that adds latency to your application.
The batch job picks up these logs, parses them, and computes aggregations: tokens per minute, p99 latency, error rates by model, costs by user. These aggregates go into ClickHouse.
The Dashboard
Your Grafana dashboard should answer these questions at a glance:
Health: Are error rates normal? Any latency spikes?
Costs: What's our daily/weekly spend? Which agents cost the most?
Usage: Request volume trends. Any unexpected spikes?
Deep Dives: Ability to filter by user, agent, time range.
We typically build 15-20 panels covering these areas. The key is making the default view useful—you shouldn't have to click around to see if something's broken.
When to Build vs. Buy
Build your own if:
- You're making 10K+ LLM calls per day (tools get expensive)
- You have compliance/security requirements (data stays in your infra)
- You need custom metrics specific to your use case
- You have engineering capacity to maintain it
Buy a tool if:
- You're early stage and moving fast
- Volume is low (<5K calls/day)
- You need features like prompt versioning, A/B testing built-in
- Your team doesn't have data engineering expertise
Key Takeaways
- Self-hosted LLM observability can cost $50-100/month vs. $500+/month for tools
- Start with token usage, latency, and errors—add more metrics as needed
- ClickHouse + Grafana is a proven stack for time-series analytics
- Build if you're at scale or have compliance needs; buy if you're early stage