TL;DR — 5 Techniques That Work
The “prompt engineering” advice you find online is mostly useless in production. Be specific. Add examples. Give it a persona. Sure — but those are table stakes. They get you from 40% accuracy to maybe 70%. Getting to 90%+ requires a fundamentally different approach.
Production AI is a systems engineering problem. You're not writing a prompt for a chatbot demo. You're designing a component that will run billions of times, return structured data that downstream code depends on, degrade gracefully when the model is uncertain, and be versioned and tested like the rest of your codebase.
At Sanofi, we went from 72% intent recognition accuracy to 96.2% — not by making the prompt longer or “more specific” — but by applying the five techniques below systematically. Here's exactly how.
Chain-of-Thought (CoT) Prompting
The most impactful change you can make to any classification or reasoning prompt is to stop asking for the answer and start asking the model to reason step-by-step before answering. This is chain-of-thought prompting, and the accuracy delta is not marginal — it's often 15–25 percentage points on complex tasks.
Here's the before and after for a support ticket classifier:
❌ "Classify this support ticket: {ticket}"
✅ "Analyze this support ticket step by step:
1. Identify the core issue
2. Determine the urgency level (low / medium / high / critical)
3. Classify the category (billing / technical / account / feature-request)
4. Suggest the routing (tier-1 / tier-2 / billing-team / engineering)
Ticket: {ticket}
Analysis:"The difference isn't just accuracy — it's auditability. When the model reasons step-by-step, the reasoning is visible. If a ticket is misclassified, you can read the chain of thought and see exactly where the logic broke down. Was it wrong about urgency? Did it misread the core issue? That diagnostic signal is invaluable when you're tuning a system.
CoT is especially critical for tasks that involve multiple conditions, edge cases, or ambiguity. For simple, clear-cut classifications, the gain is smaller. But for anything nuanced — intent detection, document routing, risk assessment — CoT is non-negotiable.
Structured Outputs with JSON Mode
Unstructured text responses are impossible to reliably parse at scale. The model might say “The urgency is high” one time and “This appears to be a high-urgency ticket” the next. Downstream code that tries to extract a string like high from either phrasing is fragile and will break.
The fix is to use JSON mode and define the output schema explicitly in the system prompt. Every major model API supports this now. Here's the pattern:
const response = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
messages: [{
role: 'user',
content: `Analyze this support ticket step by step, then classify it.
Ticket: ${userInput}`,
}],
system: `You are a support ticket classifier. Always respond with valid JSON
matching this exact schema — no prose, no markdown, no backticks:
{
"intent": string, // e.g. "billing-dispute", "login-issue"
"confidence": number, // 0.0–1.0
"urgency": string, // "low" | "medium" | "high" | "critical"
"routing": string, // "tier-1" | "tier-2" | "billing-team" | "engineering"
"reasoning": string // brief explanation of classification
}`,
max_tokens: 500,
});
const parsed = JSON.parse(response.content[0].text);
// parsed.intent, parsed.confidence, parsed.routing — always reliableTwo things to note: first, telling the model “no prose, no markdown, no backticks” in the system prompt prevents the most common failure mode where the model wraps the JSON in a code block. Second, always wrap the parse in a try/catch with a fallback — even with JSON mode, edge cases exist, and a production system needs to handle them gracefully rather than crash.
Few-Shot Examples in System Prompts
Instructions tell the model what to do. Examples show it how. Three to five high-quality examples embedded in the system prompt consistently outperform 100 words of natural language instructions describing the same task.
For intent classification, the pattern looks like this:
system: `You are a support ticket classifier. Analyze each ticket step by step,
then output JSON matching the schema below.
EXAMPLES:
Input: "I was charged twice this month and need a refund immediately"
Output: {
"intent": "billing-duplicate-charge",
"confidence": 0.97,
"urgency": "high",
"routing": "billing-team",
"reasoning": "Explicit duplicate charge complaint with refund request."
}
Input: "Can't log in, tried resetting password three times"
Output: {
"intent": "auth-login-failure",
"confidence": 0.94,
"urgency": "high",
"routing": "tier-2",
"reasoning": "Password reset loop suggests account lockout or SSO issue."
}
Input: "Would love to see dark mode added"
Output: {
"intent": "feature-request-ui",
"confidence": 0.91,
"urgency": "low",
"routing": "tier-1",
"reasoning": "Feature request, no blocking issue."
}
Now classify the following ticket using the same format:`The examples do three things simultaneously: they demonstrate the expected output schema, they calibrate the model's interpretation of urgency and routing logic, and they establish a tone and reasoning style. That's more information per token than any instruction paragraph can convey.
Choose examples that cover your edge cases — not just the easy, clear-cut ones. The examples you include are implicit decision rules. If your hardest classification challenge is distinguishing a billing inquiry from a billing dispute, include an example of each.
Prompt Versioning Like Code
Prompts are code. They determine the behavior of a system component. And yet most teams store them as strings in environment variables, edit them directly in production, and have no way to roll back when something breaks.
The minimum viable prompt versioning system looks like this:
// lib/prompts/classify.ts
export const CLASSIFY_PROMPTS = {
'classify-v1': 'You are a support classifier. Given a ticket, return the category.',
'classify-v2': 'You are an expert support analyst. Given a support ticket, ' +
'identify the category, urgency, and recommended routing.',
'classify-v3': 'You are an expert support analyst. Analyze support tickets step ' +
'by step. First identify the core issue, then determine urgency, then ' +
'classify the category, then suggest routing. Output valid JSON only.',
} as const;
export type ClassifyVersion = keyof typeof CLASSIFY_PROMPTS;
// Active version — change this after testing against labeled data
export const ACTIVE_CLASSIFY_PROMPT: ClassifyVersion = 'classify-v3';Before promoting any prompt version to ACTIVE_CLASSIFY_PROMPT, run it against your labeled test dataset and verify the accuracy delta. This forces you to build the test dataset first — which is the most important thing you can do for a production AI system.
Treat every prompt change as a deployment. That means: a pull request, a diff, a test run against labeled data, an approval, and a rollback path if production accuracy drops. This sounds heavy for “just changing a string” — but that string is the decision logic of your system. It deserves the same rigor as a code change.
Confidence Scoring + Fallback Chains
LLMs are overconfident by default. Ask a model to classify something ambiguous and it will give you an answer with no indication that it was uncertain. In a production system, you need to know when the model doesn't know — so you can escalate to a human, route to a secondary model, or return a graceful fallback.
The pattern: include a confidence field in your JSON schema (0–1), and build threshold-based fallback logic around every LLM call:
interface ClassificationResult {
intent: string;
confidence: number; // 0–1
urgency: 'low' | 'medium' | 'high' | 'critical';
routing: string;
reasoning: string;
}
async function classifyTicket(input: string): Promise<ClassificationResult> {
const result = await callLLM(input);
// Low confidence → escalate to human review
if (result.confidence < 0.70) {
await flagForHumanReview(input, result);
return { ...result, routing: 'human-review' };
}
// Medium confidence → run secondary model as tiebreaker
if (result.confidence < 0.85) {
const secondary = await callSecondaryModel(input);
if (secondary.intent !== result.intent) {
// Disagreement — escalate
return { ...result, routing: 'human-review' };
}
}
return result;
}
// Wrap every call — never let a parse failure crash the pipeline
async function safeClassify(input: string): Promise<ClassificationResult | null> {
try {
return await classifyTicket(input);
} catch {
await logParseFailure(input);
return null; // Caller handles the null case
}
}At Sanofi, adding confidence thresholds alone increased end-to-end accuracy from 84% to 91%. The remaining gap closed with CoT and better few-shot examples. The key insight: accuracy is not just about the model getting things right — it's about the system knowing when to ask for help.
Context Window Management
A common beginner mistake is dumping everything into the context window — all previous messages, all documentation, all possible examples — hoping the model will figure out what's relevant. This degrades accuracy. Models perform better when context is focused, and it burns tokens unnecessarily.
The production pattern is selective retrieval: query your knowledge base, retrieve only the top-k most relevant documents, and inject just those into the prompt. This is the core of retrieval-augmented generation (RAG), and it applies even outside full RAG pipelines.
async function buildPrompt(userQuery: string): Promise<string> {
// 1. Embed the query
const queryEmbedding = await embed(userQuery);
// 2. Retrieve only the top 3 relevant documents
const relevantDocs = await vectorSearch(queryEmbedding, { topK: 3 });
// 3. Inject only what's relevant — not the full knowledge base
const context = relevantDocs
.map((doc) => `--- ${doc.title} ---
${doc.content}`)
.join('
');
return `You are a support assistant with access to the following knowledge:
${context}
Answer the user's question using only the above context. If the answer
is not in the context, say so explicitly.
Question: ${userQuery}`;
}The discipline here is “minimum necessary context.” Every token you add to the context window dilutes the signal-to-noise ratio. The model has a finite attention budget — spend it on what matters for this specific query, not on everything that might possibly be relevant.
The Sanofi Case Study: 72% → 96.2%
Here is the exact sequence of changes that moved intent recognition accuracy from 72% to 96.2% on a support ticket routing system at Sanofi:
Notice the sequence: measurement first, structural improvements second, tuning last. You can't tune what you can't measure. The test dataset is what made every subsequent step legible.
What Doesn't Work
Three anti-patterns worth naming explicitly:
- Making prompts longer hoping for better results. Longer is not better. A 1,200-token system prompt that covers every edge case in prose is worse than a 400-token prompt with 5 targeted examples. Focus beats length. Every token you add dilutes the signal.
- Not measuring accuracy before and after changes. “It feels better” is not a metric. Every prompt change must be tested against a labeled dataset. If you don't have one, you are not doing prompt engineering — you are guessing. Build the dataset first.
- Using the same prompt for development and production. The data distribution in production is different from what you tested with. Production prompts need to be validated against real production traffic samples, not synthetic test cases or cherry-picked examples you thought of yourself.
