← Back to Glossary

Prompt Engineering

Prompt engineering is the craft of writing inputs that reliably produce useful outputs from AI models. It's part instruction design, part systems thinking, and part empirical testing — not a soft skill.

What is Prompt Engineering?

Prompt engineering is the practice of designing the inputs to an AI language model — the instructions, context, examples, and constraints — to reliably produce accurate, useful, and consistent outputs. It’s the layer between a capable model and a working application: even the most powerful LLM requires well-structured prompts to perform reliably in production.

The term emerged as a job title and discipline around 2022, when GPT-3 and its successors made it clear that what you said to the model mattered enormously. Two users with the same model asking about the same topic but with different prompt structures would receive outputs of dramatically different quality. The model’s capabilities were fixed; what varied was the quality of the input.

For founders and operators building AI-powered products, prompt engineering is usually the first technical skill that matters. Before you reach for fine-tuning or custom model training, you can often achieve 80–90% of the behavior you want through careful prompt design — which is orders of magnitude cheaper, faster to iterate, and more maintainable.

Core Techniques

The most reliable prompt engineering techniques are well-established and worth knowing explicitly:

  • Role assignment: Instructing the model to adopt a specific identity (“You are an expert contract lawyer reviewing for risk”) shapes both tone and the knowledge domain it draws from.
  • Few-shot examples: Including 2–5 examples of input/output pairs in the prompt shows the model the format and style you expect, often more effectively than describing it in words.
  • Explicit output structure: Specifying exactly what format you want — JSON with specific fields, a numbered list, a table — dramatically improves the reliability of outputs for downstream programmatic use.
  • Constraint specification: Telling the model what not to do is often as important as telling it what to do. “Do not make up citations. If you don’t know, say so” can cut hallucination rates significantly.
  • Step decomposition: Breaking a complex task into sequential sub-steps in the prompt — “First identify, then classify, then summarize” — improves accuracy for multi-part reasoning tasks.

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting is one of the most well-validated techniques for improving model accuracy on reasoning tasks. Instead of asking for an answer directly, you instruct the model to think through the problem step by step before concluding. The simple addition of “Let’s think through this step by step” to a prompt meaningfully improves accuracy on math, logic, and multi-hop reasoning problems.

This works because LLMs generate tokens sequentially — the output they’ve already written becomes part of the input for the next token. By forcing the model to externalize its reasoning, you create a longer intermediate context that makes the final answer more reliable. The model is, in effect, showing its work — and bad intermediate reasoning is easier to catch than a wrong final answer delivered without explanation.

In production systems, CoT is often combined with structured output formats: the model reasons in a scratchpad field, then produces its final answer in a separate, clean output field. This gives you the reasoning benefits without forcing downstream systems to parse through the thinking to extract the result.

What Prompt Engineering Can’t Fix

Prompt engineering is powerful but not unlimited. Understanding its ceiling helps you avoid wasted effort and know when to reach for a different tool.

  • Knowledge cutoffs: If the model doesn’t know about a recent event or proprietary fact, no prompt will conjure accurate information. This is where retrieval (RAG) or function calling is needed.
  • Fundamental model capability gaps: Prompting can’t make a smaller model perform like a larger one on complex reasoning tasks. At some point, model choice matters more than prompt design.
  • Reliability at scale: Prompts that work 95% of the time fail 1 in 20 responses — which is fine for personal use and catastrophic for a customer-facing product processing thousands of requests per day. Evaluation pipelines, error handling, and guardrails are necessary for production deployment.
  • Consistency across model updates: Provider model updates can silently change how a prompt behaves. Version-pin your models in production and run regression tests when upgrading.

Related Terms and Concepts

Automation, Workflow Automation, Iteration, Product Development