Skip to main content

Command Palette

Search for a command to run...

Treating AI Like Any Other Dependency

What changes when AI becomes part of your system

Updated
11 min read
Treating AI Like Any Other Dependency
E

Software Engineer with over 7 years of experience designing and delivering scalable systems for a variety of companies. Experienced in building real-time applications, payment solutions, and AI-driven integrations. I spend most of my time designing systems and finding ways to make processes smoother and more efficient.

Who this is for: Engineers building or operating LLM-powered features in production environments.

What you'll learn: How to think about LLM integrations as infrastructure dependencies, and the operational challenges around reliability, cost, and observability.


What changes when AI becomes part of your system

Building an LLM-powered feature for a proof of concept (PoC) is like cooking for yourself, as opposed to running a restaurant. You can experiment and take shortcuts. If something breaks, you’re the only one who cares.

Deploying that same feature to production is fundamentally different. You're serving real users with real expectations. Latency matters. Consistency matters. Cost per request matters. When something breaks, users notice.

The gap between "works in Postman" and "handles thousands of requests per minute reliably" is where most teams struggle. Not because the models are weak, but because the systems were architected for the PoC environment and never properly redesigned for production load.


Understanding the shift: PoC vs production

The difference between a proof of concept and production isn't just about scale. It's about operational maturity.

PoCProduction
Single use caseMany user paths
Low trafficVariable, bursty load
Manual testingContinuous evaluation
Cost often ignoredCost is a hard constraint
Failures toleratedFailures managed

In a PoC, you're validating whether something can work. In production, you're proving it works reliably under real-world conditions whilst meeting your SLAs and SLOs. That distinction changes everything about how you design and operate your system.


Treating LLMs as dependencies, not features

Once you deploy code that calls an LLM API, it behaves like any other external service you depend on. Think about Stripe for payments or SendGrid for email. You don't assume these services are always fast or always available. You design around their constraints and build your system to handle their limitations gracefully.

LLM APIs need exactly the same treatment. When you make calls to an LLM, you're making network requests to something you don't control. These APIs have variable latency, rate limits, per-token pricing, and non-deterministic failures.

Once an LLM call sits in your critical path, your service inherits all these characteristics. Running LLM-powered features in production becomes less about prompt engineering and more about distributed systems design. You're building infrastructure that needs to be reliable, cost-effective, and observable.

Where LLMs sit in your architecture

User Request
    |
    v
Application Layer
    |
    +--> Payment API (Stripe / Paystack)
    |
    +--> Email Service (SendGrid)
    |
    +--> LLM API (OpenAI / Anthropic / etc)
            |
            +--> Latency (varies by request)
            +--> Cost (per token)
            +--> Rate limits (requests/min, tokens/min)
            +--> Failure modes (timeouts, rate limits, errors, hallucinations)

LLMs sit alongside your other critical dependencies. They're infrastructure, not magic. Just another API that needs the same operational rigour as everything else in your stack.


The reliability challenge: LLMs fail differently

Traditional APIs fail in obvious ways. Connection timeouts. 500 errors. Rate limit responses. Your monitoring catches these failures immediately.

LLM APIs often fail quietly[1]. The request succeeds with a 200 status code. The response is well-formed JSON that matches your expected schema. But the actual content is wrong, inconsistent, or low quality. From an infrastructure perspective, everything looks healthy. Your uptime metrics are green. Your error rates are low. But from a user's perspective, the feature is broken.

These quiet failures show up in ways that standard monitoring doesn't catch. Inconsistent responses to the same input[7]. Slow quality degradation over time. Catastrophic failures on edge case inputs that only surface when users report them. This requires different approaches to detection and handling.

What reliability means for LLM-powered features

For features that depend on LLMs, reliability isn't just about keeping the service up. It includes maintaining consistent output quality[7], ensuring predictable behaviour across your input distribution, keeping latency acceptable under production load, and having well-defined fallback behaviour when things go wrong.

The failure modes you need to handle fall into several categories:

Hard failures: Timeouts, 429 rate limit errors, 503 service unavailable responses.

Soft failures: Latency creeping up, throughput degrading, response quality declining over time.

Silent failures: Quality drift that shows up in your data but triggers no alerts or logs[1]. Just users slowly losing trust.

Cost failures: Token usage growing faster than expected or faster than revenue, making the feature economically unsustainable.

Silent failures are particularly difficult. No alerts. No error logs. No obvious signs that something is wrong. Just users noticing the feature isn't as good as it used to be.


Designing systems that handle failure gracefully

Production services never assume their dependencies are perfect. When Stripe is slow, you don't block the entire checkout flow indefinitely. When email delivery fails, you queue the message and retry with exponential backoff. The same principles apply to LLM calls.

You need hard timeouts on every request to enforce your latency requirements. If your feature needs to respond in under 2 seconds, your LLM call might need a 1 second timeout to leave room for other processing. You need fallback responses ready for when quality degrades or latency spikes beyond acceptable levels[2]. Dynamic request routing based on complexity and current system state helps here.

You need circuit breakers to stop retry storms when the LLM provider is having issues. And you need request routing logic that sends simple queries to faster, cheaper models whilst saving expensive, powerful models for complex tasks.

The goal isn't to prevent failure completely. That's impossible with any external dependency. The goal is to contain failures, make them predictable, and ensure they don't cascade through your system.

Building effective guardrails

Guardrails help you catch problems before they reach users[2,4,5]. Input validation catches prompt injection attempts and adversarial inputs[6]. Output validation detects hallucinations and low-quality responses[3,4]. Consistency checks ensure responses align with expected patterns[7].

These guardrails need to be fast without adding significant latency to your request path. And they need to be reliable. A guardrail that fails open defeats the purpose. When building agentic AI systems that make multiple LLM calls per user action, guardrails become even more critical. One bad output early in the chain can cascade into increasingly worse outputs downstream.


The cost challenge: budgeting for token usage

Traditional infrastructure costs scale with compute and storage. They're relatively predictable and easy to forecast. LLM costs work fundamentally differently.

Your LLM costs scale with request volume, input token count[9] (both prompt and context), output token count (completion length), and model choice. Premium models might cost 10-30x more per token than smaller models. Small architectural decisions compound quickly.

Adding 2,000 tokens of context to every request "just to be safe" across a million requests per day can add thousands of pounds to your monthly bill. Not setting reasonable output length limits means some requests might generate 2,000 token responses whilst others generate 200 tokens. Those long responses cost 10x more.

These aren't edge cases. They're systematic cost drivers you need to think about at design time, not after your first invoice arrives. When building systems that operate at scale, cost optimisation isn't optional. It's the difference between a viable product and one that burns money faster than it generates value.

Making caching a core part of your architecture

The fastest way to burn through your LLM budget is calling the API for the same thing multiple times. This is where caching becomes essential[8].

You need exact match caching for identical prompts. If 100 users ask the same question, hit the LLM once and serve the other 99 from cache. You need semantic similarity caching for near-duplicates. If users are asking essentially the same question with slightly different wording, you shouldn't need 50 separate LLM calls. Set appropriate TTLs based on how often your ground truth changes.

Not every request needs real-time execution. Batch processing trades latency for efficiency. Lower cost per request. Better rate limit utilisation. More predictable system behaviour. These are standard distributed systems patterns. They just matter more when you're paying per token.

Proper caching can significantly reduce LLM costs compared to naive implementations. The difference between profitable and unprofitable often comes down to how well you cache.

Monitoring token usage

You need visibility into how tokens are actually being used[9]. Track input and output token counts per request. Break this down by feature, by user type, by request pattern. This tells you where costs are coming from and where optimisation efforts will have the most impact.

Look for outliers. If most requests use 500 input tokens but some use 5,000, investigate why. If some users consistently generate much longer outputs than others, understand what's driving that behaviour. Token counting should be instrumented in your code, not something you check manually in your provider's dashboard.


The observability requirement: you can't improve what you don't measure

You can't debug what you can't see. In production, you need deep visibility into how your LLM integration is actually performing.

Track request latency at different percentiles. Knowing your median latency is useful, but knowing your 95th and 99th percentile latency tells you what your worst-case users experience. Track cost per request, per user, and per feature. Without this, you can't make informed decisions about which features are economically viable.

Track output quality metrics over time so you can detect degradation before users start complaining. Track token usage patterns showing your input and output distributions. Track failure rates broken down by type: timeouts, rate limits, error responses, and those silent quality failures.

Without this visibility, everything becomes guesswork. Cost optimisation turns into randomly trying things. Performance debugging becomes impossible. Quality issues go unnoticed until they've already impacted significant numbers of users.

Building observability into your LLM integration

LLM Request
    |
    +--> Metrics (latency at different percentiles, cost, token count)
    |
    +--> Logs (sanitised inputs/outputs, model version, parameters)
    |
    +--> Traces (end-to-end request flow, downstream dependencies)
    |
    +--> Alerts (violations of your latency and quality targets, cost spikes)

Good observability turns the LLM from a black box into something you can reason about, debug, and continuously improve. You can see exactly where latency comes from. You can identify which features or request patterns drive costs. You can detect quality degradation early. You can make data-driven decisions about optimisations and architectural changes.

Continuous evaluation matters

Testing once at launch isn't enough[3]. Your evaluation needs to be continuous because the system constantly drifts. Models get updated. User behaviour evolves. New edge cases emerge. Input distributions shift.

Set up automated quality checks that run against a representative sample of production traffic. Compare outputs over time to detect drift. Alert when quality metrics drop below acceptable thresholds. This doesn't mean evaluating every single request. But you need systematic, ongoing evaluation that catches problems before they become visible to users.


Production traffic behaves differently than test traffic

Features that work perfectly in test environments often break in production. Usually not because the code changed. The inputs changed.

Production brings adversarial inputs where users try prompt injection or attempt to jailbreak your system[6]. It brings traffic spikes when your app gets featured or goes viral. Suddenly you're dealing with 10x or 100x your normal load. It brings integration latency where your database is slow or another service is degraded, making your LLM calls wait even though the LLM itself is fast. And it brings upstream changes: model updates from your provider, API changes you didn't expect, provider-side issues that impact your availability.

This is why testing once at launch isn't enough. The environment is always changing. Your system needs to adapt continuously.


Shifting from modelling problems to operational problems

Once you ship LLM-powered features, the challenges are fundamentally operational. You're not spending most of your time making the model smarter or crafting the perfect prompt. You're handling timeout logic and retry behaviour with exponential backoff. You're implementing circuit breakers and fallback paths. You're setting cost budgets with alerting thresholds. You're building cache invalidation strategies. You're adding token counting middleware to prevent expensive requests. You're versioning prompts so you can roll back when quality drops. You're building quality regression detection into your monitoring.

The model's capabilities still matter. But they're not your bottleneck. Your bottleneck is whether you can operate this reliably at scale whilst keeping costs under control and maintaining the quality your users expect.

Production incidents often reveal that the model is working perfectly, but timeout handling is missing and requests hang for extended periods, degrading the entire service. Uptime metrics show green because technically the service is up, but users can't complete actions because requests aren't completing.

Costs can spike significantly when proper caching isn't implemented and the API is redundantly called for the same prompts thousands of times. The feature works, but it isn't economically sustainable.

Quality can degrade over several days when there's no automated way to detect drift. By the time users complain, significant portions of the user base have been impacted. The model hasn't changed. The prompts haven't changed. But the input distribution has shifted in ways the system can't detect.

These aren't hypothetical scenarios. They're real problems that happen when you treat LLMs as something special instead of as infrastructure that needs proper operational discipline.

The work of running LLMs in production isn't about making the model smarter. It's about building the infrastructure, monitoring, and operational processes around it so it runs reliably, cost-effectively, and predictably at scale.


References

[1] Vinay, V. (2025). Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications. Microsoft Security Research. https://arxiv.org/pdf/2511.19933

[2] OpenAI. (2023). How to use guardrails. OpenAI Cookbook. https://cookbook.openai.com/examples/how_to_use_guardrails

[3] OpenAI. (2025). Receipt inspection: Eval-driven system design. OpenAI Cookbook. https://cookbook.openai.com/examples/partners/eval_driven_system_design/receipt_inspection

[4] OpenAI. (2024). Developing hallucination guardrails. OpenAI Cookbook. https://cookbook.openai.com/examples/developing_hallucination_guardrails

[5] OpenAI. (2025). GPT OSS safeguard guide. OpenAI Cookbook. https://cookbook.openai.com/articles/gpt-oss-safeguard-guide

[6] Anthropic. (n.d.). Mitigate jailbreaks. Claude Documentation. https://platform.claude.com/docs/en/test-and-evaluate/strengthen-guardrails/mitigate-jailbreaks

[7] Anthropic. (n.d.). Increase consistency. Claude Documentation. https://platform.claude.com/docs/en/test-and-evaluate/strengthen-guardrails/increase-consistency

[8] Anthropic. (n.d.). Prompt caching. Claude Documentation. https://platform.claude.com/docs/en/build-with-claude/prompt-caching

[9] Anthropic. (n.d.). Token counting. Claude Documentation. https://platform.claude.com/docs/en/build-with-claude/token-counting