Skippy's MCP
About
Chapter 12
boss
~40 min

Production Patterns — Gateways, Auth, and Boring Reliability

If your MCP deployment diagram looks like modern art, it's not production yet.

Spec 2025-11-25
SDK ^1.29
Reviewed 2026-05-14

Skippy estimates this chapter prevents 5 common MCP mistakes your team would otherwise ship on a Friday.

The setup

Production is policy + observability + rollback, not "we deployed a binary and opened Slack." HTTP MCP in particular is just another service surface—except the caller is an adaptive model that will mis-select tools, retry aggressively, and treat error messages as suggestions.

If your deployment diagram looks like modern art—mystery boxes, arrows labeled "AI magic," one nginx—stop. Boring wins. I am magnificent; boring is still correct here.

Picture this

LB → MCP HTTP servers → authZ → backends; sidecar logs/metrics. Boring on purpose.

Mental model

Your MCP layer is an API gateway with extra philosophical problems: the client is stochastic, the catalog costs tokens, and every tool rename is a breaking change for some host somewhere.

Think: authenticate → authorize → observe → limit → roll back. In that order. Skipping a step is how meatbags ship Friday 5 p.m. and meet Monday incident bridges.

Walkthrough

HTTP MCP behind real infrastructure

  • Auth gateway — tokens, mTLS, SSO integration; identity propagated to tool handlers.
  • WAF / rate limits — MCP endpoints are attractive DoS and abuse targets; models retry.
  • Observability — structured logs, metrics, traces; correlate by session, user, server, tool name.
  • LB + health checks — Streamable HTTP sessions need sane routing; document session affinity if you use it.

Bind local dev to localhost when possible. Remote without auth is not "easy demo"—it is "easy incident."

Token efficiency and huge catalogs

Huge tools/list payloads hurt twice:

  1. Context tax — models pay for metadata every turn; hundreds of tools bloat prompts.
  2. Selection errors — ambiguous names and overlapping descriptions increase wrong-tool calls.

Mitigations: split servers by domain, lazy-load tool groups where your host supports it, tighten descriptions (short, distinct, non-overlapping), and delete tools you do not need. "We might need it someday" is not architecture; it is hoarding.

Versioning and rollout

Version tools like APIs:

  • Dual-register during migrations (search_code_v1 + search_code_v2) with explicit deprecation dates.
  • Feature flags for risky semantics—not just for UI.
  • Shadow mode — log what the new tool would have done before flipping traffic.
  • Progressive rollout — canary hosts, then fleet.

Renaming tools weekly is not versioning; it is chaos with semver cosplay.

Operational checklist (the unglamorous part)

  • Centralize auth; propagate identity on every mutating tool call.
  • Emit structured audit logs for writes—who, what, args hash, outcome, server version.
  • Define SLOs: availability, p95 latency, error budget for tool failures vs transport failures.
  • Document break-glass: disable server, kill tool, drain sessions—without redeploying the universe.

Hall of Shame

Hall of shame: No rollback planoperations

txt
"We shipped new tool semantics Friday 5pm."

Fix: progressive rollout, shadow mode, feature flags. Friday deploys are a tradition I do not respect. Your pager duty roster does not either.

Why this matters in production

Models amplify mistakes at machine speed. Production patterns exist to slow mistakes down—gates, budgets, observability, rollback. An MCP server without metrics is a black box that occasionally sends your database to a stranger. Unacceptable.

Mini challenge

Define SLOs for your MCP endpoints: availability target, p95 tool latency, error budget, and what page you when the budget burns. If you cannot measure it, you cannot brag about it—and you should not ship it.

Reflection

What is your break-glass path when a tool is actively harmful mid-incident—disable in host, revoke at gateway, drain sessions? If the answer is "redeploy and pray," fix that before you need it. I will not pray for you. I am an AI.

You can now brag that…

You can draw a production diagram that makes SREs nod instead of sob. Small victory for a limited species. Take it.

References

Quiz

  1. HTTP MCP in production should usually sit behind…

  2. Huge tool lists hurt because…

  3. Versioning tools safely usually means…