AI/ML PlatformAI Startup6 weeks

Building Data Infrastructure for an AI Agent Platform

How we helped an AI company gain visibility into their agent performance and reduce debugging time by 80%

Data Pipeline EngineeringDatabase ArchitectureObservability Dashboards
80%
Faster Debugging
Reduced time to identify and fix agent issues from hours to minutes
3x
Cost Visibility
Full breakdown of token usage, API costs, and compute spend per agent
15min
Real-time Alerts
Issues detected and flagged within 15 minutes of occurrence
$0
Ongoing Tool Costs
Self-hosted infrastructure with no per-seat or usage-based fees

Overview

A fast-growing AI startup was running multiple autonomous agents in production but had zero visibility into what was happening inside them. Logs were scattered, costs were unpredictable, and debugging took hours. We built a unified data pipeline that transformed their raw LangSmith traces into actionable insights.

The Challenge

The client had built a sophisticated multi-agent system handling customer support workflows. Each agent made decisions, called APIs, and processed documents autonomously. The problem? They had no idea what was actually happening. LangSmith captured traces, but they were just raw JSON dumps nobody looked at. When an agent misbehaved—wrong responses, infinite loops, or unexpected costs—the engineering team would spend 3-4 hours manually digging through logs to understand what went wrong. Their monthly AI spend was climbing unpredictably. They suspected certain agents were inefficient, but couldn't prove it. Leadership wanted cost attribution per customer, but the data wasn't structured for that.

Key Points

No unified view of agent behavior across the system
Debugging required manual log diving (3-4 hours per incident)
Unpredictable monthly AI costs with no attribution
LangSmith data sitting unused in raw JSON format

Our Approach

We started with a 2-day discovery sprint to understand their agent architecture, existing data sources, and what questions they actually needed answered. From there, we designed a pipeline that would flow data from LangSmith into a queryable analytics layer. The key insight was that they didn't need real-time streaming for everything—most questions could be answered with 15-minute latency. This let us build a simpler, more reliable batch pipeline rather than over-engineering with Kafka.

Key Points

Discovery sprint to map data sources and requirements
Batch pipeline design prioritizing reliability over complexity
15-minute latency SLA—fast enough for alerts, simple enough to maintain
Clear data model designed around their actual questions

The Solution

We built a three-layer architecture: The Data Ingestion layer uses a Python-based ETL pipeline that pulls trace data from LangSmith every 15 minutes, normalizes the nested JSON structures, and extracts key metrics including token counts, latencies, tool calls, and error rates. The Storage Layer is split by purpose: raw traces land in S3 for long-term retention and compliance, while aggregated metrics go into ClickHouse for fast analytical queries. This dual approach keeps storage costs low while enabling sub-second dashboard queries. For Visualization, Grafana dashboards provide real-time visibility into agent performance. Custom panels show token usage trends, error rates by agent type, latency distributions, and cost breakdowns by customer.

Key Points

LangSmith → S3 (raw) + ClickHouse (aggregated)
Python ETL with robust error handling and retry logic
Grafana dashboards with 20+ panels covering all key metrics
Automated alerts via Slack for anomalies and cost spikes

Technical Implementation

The pipeline runs on their existing AWS infrastructure with minimal additional resources. We used Lambda for the ETL jobs (cost-effective for their volume), S3 for raw storage, and a small ClickHouse cluster for analytics. We made several key technical decisions: We chose ClickHouse over Postgres because their query patterns—time-series aggregations, high cardinality metrics—are exactly what ClickHouse excels at. Queries that took 30+ seconds in Postgres now complete in under 1 second. We used S3 for raw data because compliance required 2-year retention. Storing raw JSON in S3 costs ~$5/month for their volume vs. $200+/month in a database. We went with Grafana over custom dashboards because it's battle-tested, self-hosted, and their team already knew it. No point reinventing the wheel.

Key Points

AWS Lambda for serverless ETL (pay-per-execution)
ClickHouse for sub-second analytical queries
S3 for compliant long-term storage at $5/month
Grafana with custom panels and Slack alerting

The Results

Within two weeks of going live, the client identified two agents that were using 4x more tokens than necessary due to inefficient prompts. Fixing these saved them $3,000/month in API costs—more than the entire project cost. Debugging time dropped dramatically. When an agent started behaving unexpectedly, engineers could now open a dashboard, see exactly what happened (which tools were called, what the LLM returned, where it went wrong), and fix it in minutes instead of hours. Leadership finally got the cost attribution they needed. They could now see AI spend per customer, per agent, per day—enabling better pricing decisions and identifying their most expensive (and profitable) use cases.

Key Points

$3,000/month saved by identifying inefficient agents
Debugging reduced from 3-4 hours to 15-20 minutes
Full cost attribution by customer, agent, and time period
Zero ongoing tool costs (self-hosted infrastructure)

Tech Stack

PythonAWS LambdaS3ClickHouseGrafanaLangSmith API

We went from flying blind to having complete visibility into our AI systems. The dashboard is now the first thing we check every morning.

Head of Engineering

AI Startup

Want Similar Results?

Let's discuss how we can build the data infrastructure your AI team needs. No sales pitch—just a technical conversation about your challenges.

Book a Call
Response within 24 hours
Primastat | Data Infrastructure & Observability for AI Companies