LLMObservabilityTutorial

Building LLM Observability From Scratch: A Practical Guide

How to build a complete observability stack for your LLM applications without expensive third-party tools.

Primastat Team

February 5, 2026 · 8 min read

LLM observability tools are exploding in popularity—LangSmith, Helicone, Langfuse, and dozens more. But they all share a problem: they get expensive fast, especially at scale. If you're running thousands of LLM calls per day, you might be better off building your own observability stack. Here's exactly how to do it.

What You Actually Need to Track

Before building anything, define what you need to observe:

Token Usage: Input and output tokens per request. This directly maps to cost.

Latency: Time to first token (TTFT) and total completion time. Critical for user experience.

Error Rates: Failed requests, rate limits, malformed responses.

Context: What triggered the request? User ID, session, agent type, etc.

Quality Signals: User feedback, automated evaluations, detected hallucinations.

Most teams start by tracking everything, then realize they only look at 20% of the data. Start minimal.

The Architecture

Our recommended stack for self-hosted LLM observability:

Logging Layer: Wrap your LLM calls in a logging function that captures request/response metadata. Store raw logs in S3 (cheap, compliant).

Processing Layer: A batch job (Lambda, Airflow, or Mage) that runs every 15 minutes. It reads raw logs, extracts metrics, and writes to your analytics database.

Storage Layer: ClickHouse for time-series metrics. It handles billions of rows and returns queries in milliseconds. Postgres works too for smaller scale.

Visualization: Grafana dashboards. Open source, battle-tested, and your team probably already knows it.

Implementation Details

The logging wrapper is the key piece. Here's the concept:

Before every LLM call, capture: timestamp, model, prompt length, and context (user ID, agent type, etc.).

After the call, capture: completion tokens, total latency, response status, and any errors.

Write this to a log file or send directly to S3. Don't process it inline—that adds latency to your application.

The batch job picks up these logs, parses them, and computes aggregations: tokens per minute, p99 latency, error rates by model, costs by user. These aggregates go into ClickHouse.

The Dashboard

Your Grafana dashboard should answer these questions at a glance:

Health: Are error rates normal? Any latency spikes?

Costs: What's our daily/weekly spend? Which agents cost the most?

Usage: Request volume trends. Any unexpected spikes?

Deep Dives: Ability to filter by user, agent, time range.

We typically build 15-20 panels covering these areas. The key is making the default view useful—you shouldn't have to click around to see if something's broken.

When to Build vs. Buy

Build your own if:

You're making 10K+ LLM calls per day (tools get expensive)
You have compliance/security requirements (data stays in your infra)
You need custom metrics specific to your use case
You have engineering capacity to maintain it

Buy a tool if:

You're early stage and moving fast
Volume is low (<5K calls/day)
You need features like prompt versioning, A/B testing built-in
Your team doesn't have data engineering expertise

Key Takeaways

Self-hosted LLM observability can cost $50-100/month vs. $500+/month for tools
Start with token usage, latency, and errors—add more metrics as needed
ClickHouse + Grafana is a proven stack for time-series analytics
Build if you're at scale or have compliance needs; buy if you're early stage

Data EngineeringAI Infrastructure

Why Most AI Companies Don't Need Real-Time Data (And What They Actually Need)

6 min read

Data ArchitectureStartups

5 Data Architecture Mistakes AI Startups Make (And How to Avoid Them)

7 min read

Need Help With Your Data?

We build custom data pipelines, observability dashboards, and AI infrastructure for teams like yours.

Book a Consultation

Response within 24 hours