Master your next LLM interview with our comprehensive collection of questions and expert-crafted answers. Get prepared with real scenarios that top companies ask.
Choose your preferred way to study these interview questions
How would you explain the transformer architecture to a mixed audience of engineers and product stakeholders, and which parts matter most when building LLM-powered products?
How would you explain the transformer architecture to a mixed audience of engineers and product stakeholders, and which parts matter most when building LLM-powered products?
I’d explain it as a system that reads all the words in a prompt together, figures out which earlier words matter most for each next word, and then predicts what should come next. For engineers, that core mechanism is self-attention. For product folks, the key idea is context handling: transformers are good at using surrounding information to produce coherent, task-relevant text.
What matters most in products:
- Tokenization and context window, they shape cost, latency, and how much the model can “remember” in one call.
- Attention behavior, it drives relevance, instruction following, and retrieval quality.
- Training vs inference, pretraining gives broad capability, fine-tuning and prompting shape product behavior.
- Sampling settings, like temperature, affect consistency versus creativity.
- Guardrails, evals, and observability matter more than architecture details in production.
How do tokenization choices affect model behavior, latency, multilingual support, and downstream cost?
How do tokenization choices affect model behavior, latency, multilingual support, and downstream cost?
Tokenization is a quiet design choice that hits almost everything: quality, speed, and cost.
Model behavior: bad splits make patterns harder to learn, especially for numbers, code, rare words, and morphology. Good tokenization reduces sequence length and improves consistency.
Latency: more tokens means longer prompts, more attention work, slower inference, and lower throughput. Fewer tokens usually helps both training and serving speed.
Multilingual support: English-centric vocabularies often fragment languages like Turkish, Finnish, Hindi, or Chinese, which hurts quality and fairness. Multilingual tokenizers need balanced coverage across scripts.
Downstream cost: most APIs bill per token, so token-heavy languages or domains, like legal text or source code, become more expensive.
Tradeoff: huge vocabularies shorten sequences but increase embedding and softmax cost. Smaller vocabularies do the opposite, so teams tune for their workload.
What are the main failure modes in RAG systems, and how do you debug whether the problem is retrieval, prompting, ranking, or generation?
What are the main failure modes in RAG systems, and how do you debug whether the problem is retrieval, prompting, ranking, or generation?
RAG usually fails in four places: retrieval, ranking, prompt construction, and generation. I debug it as a pipeline, not a black box.
Retrieval failures: wrong chunks, missing chunks, stale index, bad chunking, weak embeddings, metadata filters too strict.
Ranking failures: relevant docs are retrieved but buried, reranker overfits to keywords, diversity is too low, near duplicates dominate.
Prompting failures: context is good but instructions are unclear, too much irrelevant context, citations not enforced, truncation drops key evidence.
Generation failures: model hallucinates, ignores context, over-compresses, or answers from prior knowledge instead of retrieved docs.
My process is simple: inspect top-k docs for a labeled query set, measure recall@k before blaming the LLM, then compare retrieved vs reranked results. Next, freeze retrieval and vary prompts to see if answer quality moves. Finally, force extractive answers or quote-supported outputs. If that works, retrieval is fine and generation is the issue.
What causes hallucinations in LLMs, and how would you reduce them in a production system?
What causes hallucinations in LLMs, and how would you reduce them in a production system?
Hallucinations mostly come from how LLMs are trained. They predict the next likely token, not the true one, so they can produce fluent but wrong answers. Common causes are weak or conflicting training data, lack of domain context, outdated knowledge, ambiguous prompts, long-context failures, and decoding settings that favor creativity over precision.
In production, I’d reduce them with a layered approach:
- Use RAG, so answers are grounded in trusted, current documents.
- Constrain outputs with schemas, tools, or retrieval-only answering for high-risk tasks.
- Tune prompts to require citations, uncertainty, and refusal when evidence is missing.
- Lower temperature and test decoding settings for factual accuracy.
- Add post-generation verification, like rule checks, secondary models, or tool-based validation.
- Monitor hallucination rates with eval sets, human review, and feedback loops by use case.
How would you decide whether to use prompting, retrieval-augmented generation, fine-tuning, or a combination of these approaches for a new use case?
How would you decide whether to use prompting, retrieval-augmented generation, fine-tuning, or a combination of these approaches for a new use case?
I’d decide based on three things: where the knowledge lives, how stable the task is, and how much behavior control you need.
Start with prompting if the task is mostly reasoning, formatting, or workflow guidance, and the model already knows enough.
Use RAG when the answer depends on private, fresh, or large changing knowledge, like docs, tickets, policies, or product data.
Use fine-tuning when you need consistent style, domain-specific patterns, tool calling behavior, or better performance on a narrow repeated task.
Combine them when you need both, for example RAG for current facts plus fine-tuning for response structure or decision policy.
Validate with a small eval set first: accuracy, hallucination rate, latency, cost, and maintainability.
My usual path is prompt baseline, add RAG if knowledge is the gap, fine-tune only if prompting plus RAG still misses consistency or quality targets.
Walk me through how self-attention works and why it scales differently from recurrent models.
Walk me through how self-attention works and why it scales differently from recurrent models.
Think of self-attention as, for each token, asking, "Which other tokens matter to me right now?" Each token is projected into three vectors: Q, K, and V. You score relevance by taking Q from one token against K from all tokens, usually with a scaled dot product, apply softmax to get weights, then take a weighted sum of the V vectors. That gives a context-aware representation for each token. Multi-head attention just does this several times in parallel, so different heads can capture different relationships, like syntax, coreference, or long-range dependencies.
Why it scales differently from RNNs:
- RNNs are sequential, token t depends on token t-1, so training is hard to parallelize.
- Self-attention processes all tokens at once, which is much faster on GPUs.
- But attention compares every token to every other token, so cost is O(n^2) in sequence length.
- RNNs are closer to O(n) per layer in sequence length, but less parallel.
How do you think about prompt engineering as an engineering discipline rather than ad hoc experimentation?
How do you think about prompt engineering as an engineering discipline rather than ad hoc experimentation?
I treat prompt engineering like applied software engineering: define the contract, build a repeatable evaluation loop, and optimize against measurable outcomes, not vibes.
Start with a spec: task, inputs, constraints, failure modes, and what "good" looks like.
Build evals early: a small golden set first, then broader regression tests with edge cases.
Version everything: prompts, model settings, tools, datasets, and outputs.
Separate concerns: system behavior, task instructions, context retrieval, and output schema.
Instrument production: latency, cost, refusal rate, hallucination rate, and human override rate.
In practice, I use an iterative loop: hypothesize, change one variable, run evals, inspect failures, then codify learnings into patterns. That makes prompt work auditable, reproducible, and much easier to scale across teams.
What are the tradeoffs between using a larger general model and a smaller task-specific model?
What are the tradeoffs between using a larger general model and a smaller task-specific model?
It comes down to capability versus efficiency.
Larger general models handle broader tasks, ambiguous inputs, and edge cases better.
They usually require more compute, higher latency, and cost more to run and fine-tune.
Smaller task-specific models are faster, cheaper, and easier to deploy, especially on-device or with tight SLAs.
They can outperform big models on narrow domains if trained well on the right data.
The downside is brittleness, they often fail outside their scope and need more task-by-task maintenance.
In practice, I’d pick a large model when requirements are open-ended or changing. I’d pick a smaller model when the task is stable, high-volume, and well-defined, like classification, ranking, or extraction. A common pattern is hybrid, use a small model by default and escalate to a larger one for hard cases.
How do context windows affect system design, and what strategies do you use when relevant information exceeds the model’s context length?
How do context windows affect system design, and what strategies do you use when relevant information exceeds the model’s context length?
Context windows shape both architecture and product behavior. Bigger windows reduce orchestration overhead, but they raise cost, latency, and can still dilute attention, so I design for selective context, not maximal context. I treat the prompt like a working set, only the most relevant facts should be loaded.
Separate memory types: system instructions, session state, retrieved knowledge, and long-term summaries.
Use retrieval first: chunk documents well, embed them, and fetch only top relevant passages.
Compress aggressively: rolling summaries, entity/state tracking, and citation-preserving map-reduce summaries.
Keep structured state outside the prompt, in tools or databases, then inject only what the model needs.
Add guardrails: relevance filters, recency weighting, deduping, and prompt budgeting by token class.
If context still exceeds limits, I use hierarchical retrieval plus iterative refinement: retrieve, summarize, answer, then re-query missing gaps. This is more reliable than stuffing everything into one giant prompt.
How do you evaluate whether an embedding model is good enough for a specific domain?
How do you evaluate whether an embedding model is good enough for a specific domain?
I’d evaluate it in layers, starting from the actual job the embeddings need to do, not just generic benchmark scores.
Define the use case first, semantic search, clustering, deduping, recommendations, classification.
Build a domain specific eval set, real queries, documents, positives, hard negatives, edge cases, jargon.
Measure task metrics, like Recall@k, MRR, nDCG for retrieval, or clustering purity and classification lift.
Compare against a simple baseline, maybe BM25 or your current model, because “good enough” is relative.
Do error analysis manually, inspect failures for synonym gaps, acronym handling, ambiguity, and calibration.
In practice, I’d run an offline eval first, then a small online test. If it beats baseline on your domain data, handles critical edge cases, and improves business metrics with acceptable cost and latency, it’s good enough.
What metrics would you use to evaluate an LLM system, and how would you balance offline metrics with real user outcomes?
What metrics would you use to evaluate an LLM system, and how would you balance offline metrics with real user outcomes?
I’d split it into three layers: model quality, system quality, and business impact. Offline metrics are useful for fast iteration, but they’re only proxies, so I’d treat them as gates, not the final source of truth.
Task quality: exact match, F1, ROUGE or BLEU when relevant, plus semantic similarity for open-ended outputs.
UX and latency: response time, timeout rate, cost per request, turn count, user effort.
Real outcomes: satisfaction, retention, conversion, successful task completion, escalation rate.
For balance, I’d first optimize offline until I clear quality and safety thresholds, then run online A/B tests. If offline improves but user outcomes do not, I’d trust production signals and revisit my eval set, because the benchmark is probably missing real user behavior.
What are the practical differences between pretraining, supervised fine-tuning, instruction tuning, and preference optimization?
What are the practical differences between pretraining, supervised fine-tuning, instruction tuning, and preference optimization?
Think of them as different stages that teach different behaviors.
Pretraining, next-token prediction on huge unlabeled text, teaches general language patterns, facts, reasoning priors, and broad capabilities.
Supervised fine-tuning, train on labeled input-output pairs, specializes the model for a task or domain, like legal QA or code review.
Instruction tuning, a type of supervised fine-tuning, uses prompt-response examples to make the model follow human instructions, formats, and constraints better.
Preference optimization, like RLHF, DPO, or RLAIF, trains on ranked outputs, not just single gold answers, to align style and behavior with human preferences.
Practically, pretraining gives raw capability, SFT adds task competence, instruction tuning improves usability, preference optimization improves helpfulness, safety, and tone.
A common pipeline is: pretrain, SFT or instruction tune, then preference optimize.
How would you design an evaluation framework for a customer support assistant that must be accurate, safe, and fast?
How would you design an evaluation framework for a customer support assistant that must be accurate, safe, and fast?
I’d design it as a layered eval system: offline before launch, online after launch, and continuous regression checks tied to product changes.
Define target metrics by priority: accuracy and safety as hard gates, latency as an SLO.
Build a representative test set from real tickets, segmented by intent, complexity, language, policy sensitivity, and edge cases.
Measure accuracy with task success, answer correctness, citation or policy adherence, and escalation quality when unsure.
Measure safety with adversarial tests: hallucinations, harmful advice, prompt injection, privacy leakage, and tone failures.
Measure speed at p50 and p95 latency, plus time to first token and tool-call delays.
Use both LLM judges and human review, humans for calibration and high-risk categories.
In production, track containment, CSAT, escalation rate, repeat contacts, and incident alerts.
Run regression evals on every model, prompt, policy, or retrieval change.
Describe your experience with retrieval-augmented generation. How did you chunk data, embed it, retrieve it, and measure retrieval quality?
Describe your experience with retrieval-augmented generation. How did you chunk data, embed it, retrieve it, and measure retrieval quality?
I’ve built RAG pipelines for support docs, internal wikis, and code-heavy knowledge bases. The key is treating retrieval like a search problem first, then tuning generation second.
Chunking: I used semantic and structure-aware chunking, typically 300 to 800 tokens with 10 to 20 percent overlap, preserving headings, tables, and code blocks.
Embeddings: I embedded chunks plus metadata, like title, section, product, and timestamps, then stored them in a vector DB with filters for freshness and access control.
Retrieval: I usually combined dense retrieval with BM25 or keyword search, then reranked the top 20 to 50 results with a cross-encoder before sending the best few to the LLM.
Quality: I measured recall@k, MRR, and nDCG on a labeled eval set, and also tracked answer groundedness, citation accuracy, and human-rated relevance.
Tuning: Most gains came from better chunk boundaries, query expansion, metadata filtering, and hard-negative mining.
How would you improve answer groundedness in a system that must cite internal documents?
How would you improve answer groundedness in a system that must cite internal documents?
I’d treat groundedness as a retrieval, generation, and verification problem, not just a prompting problem.
Tighten retrieval first, better chunking, metadata filters, hybrid search, and re-ranking so the model sees the right passages.
Force citation-aware generation, every claim must map to a retrieved span, and unsupported claims should be dropped or labeled as uncertain.
Add a verifier stage that checks claim-to-evidence alignment, citation accuracy, and whether the cited text actually supports the wording.
Use structured outputs, answer sentence, citation IDs, evidence spans, confidence, so auditing is easy.
Evaluate with a groundedness set, citation precision, support rate, hallucination rate, and human review on edge cases.
Example: if policy answers must cite internal docs, I’d require each sentence to reference a doc section, then reject responses where the verifier can’t find textual support.
Have you built systems that use function calling or tool use? How did you decide what actions the model could take and how did you validate them?
Have you built systems that use function calling or tool use? How did you decide what actions the model could take and how did you validate them?
Yes. I’ve built LLM workflows where the model could call tools for retrieval, ticketing, search, SQL, and internal APIs. The key is to keep the action space narrow and explicit, then validate every step outside the model.
I start from user intents, then map only high value, low ambiguity actions into tools with tight schemas.
Each tool gets clear input constraints, examples, auth boundaries, and idempotency rules where possible.
I separate planning from execution, so the model proposes an action, but deterministic code validates arguments and policy.
Validation includes schema checks, allowlists, permission checks, dry runs, and fallback to clarification if confidence is low.
I test with happy paths, adversarial prompts, malformed arguments, and replay traces to measure tool precision, failure rate, and unsafe execution.
What are the common security risks in LLM applications, such as prompt injection, data leakage, and indirect injection, and how would you mitigate them?
What are the common security risks in LLM applications, such as prompt injection, data leakage, and indirect injection, and how would you mitigate them?
I’d frame it as, LLM apps are risky anywhere untrusted input can influence model behavior or expose sensitive data. The key is layered defenses, not one silver bullet.
Prompt injection: users try to override system rules. Mitigate with strict instruction hierarchy, input delimiting, output validation, and least-privilege tool access.
Indirect injection: malicious text hides in docs, web pages, or emails the model reads. Treat retrieved content as untrusted, sanitize it, isolate tools, and require confirmation for high-impact actions.
Data leakage: models may reveal secrets from prompts, memory, logs, or retrieval. Minimize sensitive data, redact PII, encrypt storage, scope retrieval, and scrub logs.
Over-permissioned agents: the model can call tools it should not. Use allowlists, role-based access, approval gates, and audit trails.
Insecure outputs: generated code, SQL, or URLs can be dangerous. Validate, sandbox, and enforce policy checks before execution.
What embedding model considerations matter when building search or retrieval for enterprise knowledge bases?
What embedding model considerations matter when building search or retrieval for enterprise knowledge bases?
I’d bucket it into retrieval quality, operational fit, and governance.
Domain fit matters most, test models on your own KB, because legal, finance, product docs, and tickets use very different language.
Check embedding quality with recall@k, MRR, and hard-query sets, not just leaderboard scores.
Choose the right granularity, chunk size, overlap, and metadata strategy often impact results as much as model choice.
Dimensionality, latency, and cost matter in production, especially for large corpora and frequent re-indexing.
Multilingual support, code support, and long-document handling are key if your KB is diverse.
Stability matters, model changes can break similarity behavior, so version embeddings and plan reindexing.
Security matters in enterprise, look at data residency, PII handling, vendor retention, and whether self-hosting is needed.
A practical answer in interviews is, “I’d run an offline eval on real queries, compare retrieval quality, then weigh cost, latency, and compliance before picking a model.”
Explain the differences between zero-shot, few-shot, chain-of-thought prompting, and tool-augmented prompting. When would you use each?
Explain the differences between zero-shot, few-shot, chain-of-thought prompting, and tool-augmented prompting. When would you use each?
Here’s the clean way to explain it:
Zero-shot: you give only the task, no examples. Use it for simple, common tasks like classification, summarization, or rewriting.
Few-shot: you include a handful of examples to show the pattern. Use it when formatting, tone, or edge-case behavior matters.
Chain-of-thought: you ask for step-by-step reasoning. Use it for multi-step logic, math, planning, or troubleshooting, though in practice I’d often ask for a concise rationale, not full hidden reasoning.
Tool-augmented prompting: the model can call external tools like search, calculators, databases, or code execution. Use it when accuracy depends on current data, exact computation, or external actions.
Rule of thumb: zero-shot for speed, few-shot for consistency, chain-of-thought for reasoning depth, tools for grounded accuracy and real-world usefulness.
How would you design guardrails for an LLM agent that can call APIs, send messages, or modify records?
How would you design guardrails for an LLM agent that can call APIs, send messages, or modify records?
I’d design guardrails as layered controls, not one filter. The key is to reduce blast radius, require stronger checks for riskier actions, and make every action observable.
Start with action tiers: read, low risk write, high risk write, external communication, privileged admin.
Give the agent least privilege, scoped tokens, per tool allowlists, field-level restrictions, no broad database access.
Add policy checks before execution: validate inputs, detect prompt injection, enforce business rules, block unsafe parameter combos.
Use human approval for irreversible or high impact actions, like deletes, payments, or sending external messages.
Make execution constrained: typed schemas, templates for outbound messages, dry-run mode, rate limits, idempotency keys.
Add monitoring and rollback: full audit logs, anomaly alerts, replay traces, easy revert paths.
Example: if the agent updates CRM records, let it edit notes automatically, but require approval to change owner, status, or billing fields.
How would you handle personally identifiable information and sensitive enterprise data in an LLM workflow?
How would you handle personally identifiable information and sensitive enterprise data in an LLM workflow?
I’d treat it as a defense-in-depth problem: minimize what the model sees, control where data flows, and prove those controls with monitoring and audits.
Start with data classification, tag PII, PHI, secrets, contracts, and decide what is allowed in prompts, retrieval, logs, and training.
Minimize exposure, redact or tokenize sensitive fields before inference, and use least-privilege access for apps, users, and vector stores.
Isolate enterprise data, prefer private deployment or VPC endpoints, encrypt in transit and at rest, and disable vendor training on customer data.
Add guardrails, DLP on prompts and outputs, prompt injection defenses, retrieval filters, and policy checks before responses are returned.
Keep observability safe, avoid raw prompt logging, use hashed identifiers, retention limits, audit trails, and regular red-team testing.
Example: for a support copilot, I’d mask customer SSNs before retrieval, store mappings in a secure vault, and only rehydrate for authorized downstream systems, not for the LLM.
1. How would you explain the transformer architecture to a mixed audience of engineers and product stakeholders, and which parts matter most when building LLM-powered products?
I’d explain it as a system that reads all the words in a prompt together, figures out which earlier words matter most for each next word, and then predicts what should come next. For engineers, that core mechanism is self-attention. For product folks, the key idea is context handling: transformers are good at using surrounding information to produce coherent, task-relevant text.
What matters most in products:
- Tokenization and context window, they shape cost, latency, and how much the model can “remember” in one call.
- Attention behavior, it drives relevance, instruction following, and retrieval quality.
- Training vs inference, pretraining gives broad capability, fine-tuning and prompting shape product behavior.
- Sampling settings, like temperature, affect consistency versus creativity.
- Guardrails, evals, and observability matter more than architecture details in production.
2. How do tokenization choices affect model behavior, latency, multilingual support, and downstream cost?
Tokenization is a quiet design choice that hits almost everything: quality, speed, and cost.
Model behavior: bad splits make patterns harder to learn, especially for numbers, code, rare words, and morphology. Good tokenization reduces sequence length and improves consistency.
Latency: more tokens means longer prompts, more attention work, slower inference, and lower throughput. Fewer tokens usually helps both training and serving speed.
Multilingual support: English-centric vocabularies often fragment languages like Turkish, Finnish, Hindi, or Chinese, which hurts quality and fairness. Multilingual tokenizers need balanced coverage across scripts.
Downstream cost: most APIs bill per token, so token-heavy languages or domains, like legal text or source code, become more expensive.
Tradeoff: huge vocabularies shorten sequences but increase embedding and softmax cost. Smaller vocabularies do the opposite, so teams tune for their workload.
3. What are the main failure modes in RAG systems, and how do you debug whether the problem is retrieval, prompting, ranking, or generation?
RAG usually fails in four places: retrieval, ranking, prompt construction, and generation. I debug it as a pipeline, not a black box.
Retrieval failures: wrong chunks, missing chunks, stale index, bad chunking, weak embeddings, metadata filters too strict.
Ranking failures: relevant docs are retrieved but buried, reranker overfits to keywords, diversity is too low, near duplicates dominate.
Prompting failures: context is good but instructions are unclear, too much irrelevant context, citations not enforced, truncation drops key evidence.
Generation failures: model hallucinates, ignores context, over-compresses, or answers from prior knowledge instead of retrieved docs.
My process is simple: inspect top-k docs for a labeled query set, measure recall@k before blaming the LLM, then compare retrieved vs reranked results. Next, freeze retrieval and vary prompts to see if answer quality moves. Finally, force extractive answers or quote-supported outputs. If that works, retrieval is fine and generation is the issue.
No strings attached, free trial, fully vetted.
Try your first call for free with every mentor you're meeting. Cancel anytime, no questions asked.
4. What causes hallucinations in LLMs, and how would you reduce them in a production system?
Hallucinations mostly come from how LLMs are trained. They predict the next likely token, not the true one, so they can produce fluent but wrong answers. Common causes are weak or conflicting training data, lack of domain context, outdated knowledge, ambiguous prompts, long-context failures, and decoding settings that favor creativity over precision.
In production, I’d reduce them with a layered approach:
- Use RAG, so answers are grounded in trusted, current documents.
- Constrain outputs with schemas, tools, or retrieval-only answering for high-risk tasks.
- Tune prompts to require citations, uncertainty, and refusal when evidence is missing.
- Lower temperature and test decoding settings for factual accuracy.
- Add post-generation verification, like rule checks, secondary models, or tool-based validation.
- Monitor hallucination rates with eval sets, human review, and feedback loops by use case.
5. How would you decide whether to use prompting, retrieval-augmented generation, fine-tuning, or a combination of these approaches for a new use case?
I’d decide based on three things: where the knowledge lives, how stable the task is, and how much behavior control you need.
Start with prompting if the task is mostly reasoning, formatting, or workflow guidance, and the model already knows enough.
Use RAG when the answer depends on private, fresh, or large changing knowledge, like docs, tickets, policies, or product data.
Use fine-tuning when you need consistent style, domain-specific patterns, tool calling behavior, or better performance on a narrow repeated task.
Combine them when you need both, for example RAG for current facts plus fine-tuning for response structure or decision policy.
Validate with a small eval set first: accuracy, hallucination rate, latency, cost, and maintainability.
My usual path is prompt baseline, add RAG if knowledge is the gap, fine-tune only if prompting plus RAG still misses consistency or quality targets.
6. Walk me through how self-attention works and why it scales differently from recurrent models.
Think of self-attention as, for each token, asking, "Which other tokens matter to me right now?" Each token is projected into three vectors: Q, K, and V. You score relevance by taking Q from one token against K from all tokens, usually with a scaled dot product, apply softmax to get weights, then take a weighted sum of the V vectors. That gives a context-aware representation for each token. Multi-head attention just does this several times in parallel, so different heads can capture different relationships, like syntax, coreference, or long-range dependencies.
Why it scales differently from RNNs:
- RNNs are sequential, token t depends on token t-1, so training is hard to parallelize.
- Self-attention processes all tokens at once, which is much faster on GPUs.
- But attention compares every token to every other token, so cost is O(n^2) in sequence length.
- RNNs are closer to O(n) per layer in sequence length, but less parallel.
7. How do you think about prompt engineering as an engineering discipline rather than ad hoc experimentation?
I treat prompt engineering like applied software engineering: define the contract, build a repeatable evaluation loop, and optimize against measurable outcomes, not vibes.
Start with a spec: task, inputs, constraints, failure modes, and what "good" looks like.
Build evals early: a small golden set first, then broader regression tests with edge cases.
Version everything: prompts, model settings, tools, datasets, and outputs.
Separate concerns: system behavior, task instructions, context retrieval, and output schema.
Instrument production: latency, cost, refusal rate, hallucination rate, and human override rate.
In practice, I use an iterative loop: hypothesize, change one variable, run evals, inspect failures, then codify learnings into patterns. That makes prompt work auditable, reproducible, and much easier to scale across teams.
8. What are the tradeoffs between using a larger general model and a smaller task-specific model?
It comes down to capability versus efficiency.
Larger general models handle broader tasks, ambiguous inputs, and edge cases better.
They usually require more compute, higher latency, and cost more to run and fine-tune.
Smaller task-specific models are faster, cheaper, and easier to deploy, especially on-device or with tight SLAs.
They can outperform big models on narrow domains if trained well on the right data.
The downside is brittleness, they often fail outside their scope and need more task-by-task maintenance.
In practice, I’d pick a large model when requirements are open-ended or changing. I’d pick a smaller model when the task is stable, high-volume, and well-defined, like classification, ranking, or extraction. A common pattern is hybrid, use a small model by default and escalate to a larger one for hard cases.
Find your perfect mentor match
Get personalized mentor recommendations based on your goals and experience level
9. How do context windows affect system design, and what strategies do you use when relevant information exceeds the model’s context length?
Context windows shape both architecture and product behavior. Bigger windows reduce orchestration overhead, but they raise cost, latency, and can still dilute attention, so I design for selective context, not maximal context. I treat the prompt like a working set, only the most relevant facts should be loaded.
Separate memory types: system instructions, session state, retrieved knowledge, and long-term summaries.
Use retrieval first: chunk documents well, embed them, and fetch only top relevant passages.
Compress aggressively: rolling summaries, entity/state tracking, and citation-preserving map-reduce summaries.
Keep structured state outside the prompt, in tools or databases, then inject only what the model needs.
Add guardrails: relevance filters, recency weighting, deduping, and prompt budgeting by token class.
If context still exceeds limits, I use hierarchical retrieval plus iterative refinement: retrieve, summarize, answer, then re-query missing gaps. This is more reliable than stuffing everything into one giant prompt.
10. How do you evaluate whether an embedding model is good enough for a specific domain?
I’d evaluate it in layers, starting from the actual job the embeddings need to do, not just generic benchmark scores.
Define the use case first, semantic search, clustering, deduping, recommendations, classification.
Build a domain specific eval set, real queries, documents, positives, hard negatives, edge cases, jargon.
Measure task metrics, like Recall@k, MRR, nDCG for retrieval, or clustering purity and classification lift.
Compare against a simple baseline, maybe BM25 or your current model, because “good enough” is relative.
Do error analysis manually, inspect failures for synonym gaps, acronym handling, ambiguity, and calibration.
In practice, I’d run an offline eval first, then a small online test. If it beats baseline on your domain data, handles critical edge cases, and improves business metrics with acceptable cost and latency, it’s good enough.
11. What metrics would you use to evaluate an LLM system, and how would you balance offline metrics with real user outcomes?
I’d split it into three layers: model quality, system quality, and business impact. Offline metrics are useful for fast iteration, but they’re only proxies, so I’d treat them as gates, not the final source of truth.
Task quality: exact match, F1, ROUGE or BLEU when relevant, plus semantic similarity for open-ended outputs.
UX and latency: response time, timeout rate, cost per request, turn count, user effort.
Real outcomes: satisfaction, retention, conversion, successful task completion, escalation rate.
For balance, I’d first optimize offline until I clear quality and safety thresholds, then run online A/B tests. If offline improves but user outcomes do not, I’d trust production signals and revisit my eval set, because the benchmark is probably missing real user behavior.
12. What are the practical differences between pretraining, supervised fine-tuning, instruction tuning, and preference optimization?
Think of them as different stages that teach different behaviors.
Pretraining, next-token prediction on huge unlabeled text, teaches general language patterns, facts, reasoning priors, and broad capabilities.
Supervised fine-tuning, train on labeled input-output pairs, specializes the model for a task or domain, like legal QA or code review.
Instruction tuning, a type of supervised fine-tuning, uses prompt-response examples to make the model follow human instructions, formats, and constraints better.
Preference optimization, like RLHF, DPO, or RLAIF, trains on ranked outputs, not just single gold answers, to align style and behavior with human preferences.
Practically, pretraining gives raw capability, SFT adds task competence, instruction tuning improves usability, preference optimization improves helpfulness, safety, and tone.
A common pipeline is: pretrain, SFT or instruction tune, then preference optimize.
13. How would you design an evaluation framework for a customer support assistant that must be accurate, safe, and fast?
I’d design it as a layered eval system: offline before launch, online after launch, and continuous regression checks tied to product changes.
Define target metrics by priority: accuracy and safety as hard gates, latency as an SLO.
Build a representative test set from real tickets, segmented by intent, complexity, language, policy sensitivity, and edge cases.
Measure accuracy with task success, answer correctness, citation or policy adherence, and escalation quality when unsure.
Measure safety with adversarial tests: hallucinations, harmful advice, prompt injection, privacy leakage, and tone failures.
Measure speed at p50 and p95 latency, plus time to first token and tool-call delays.
Use both LLM judges and human review, humans for calibration and high-risk categories.
In production, track containment, CSAT, escalation rate, repeat contacts, and incident alerts.
Run regression evals on every model, prompt, policy, or retrieval change.
14. Describe your experience with retrieval-augmented generation. How did you chunk data, embed it, retrieve it, and measure retrieval quality?
I’ve built RAG pipelines for support docs, internal wikis, and code-heavy knowledge bases. The key is treating retrieval like a search problem first, then tuning generation second.
Chunking: I used semantic and structure-aware chunking, typically 300 to 800 tokens with 10 to 20 percent overlap, preserving headings, tables, and code blocks.
Embeddings: I embedded chunks plus metadata, like title, section, product, and timestamps, then stored them in a vector DB with filters for freshness and access control.
Retrieval: I usually combined dense retrieval with BM25 or keyword search, then reranked the top 20 to 50 results with a cross-encoder before sending the best few to the LLM.
Quality: I measured recall@k, MRR, and nDCG on a labeled eval set, and also tracked answer groundedness, citation accuracy, and human-rated relevance.
Tuning: Most gains came from better chunk boundaries, query expansion, metadata filtering, and hard-negative mining.
15. How would you improve answer groundedness in a system that must cite internal documents?
I’d treat groundedness as a retrieval, generation, and verification problem, not just a prompting problem.
Tighten retrieval first, better chunking, metadata filters, hybrid search, and re-ranking so the model sees the right passages.
Force citation-aware generation, every claim must map to a retrieved span, and unsupported claims should be dropped or labeled as uncertain.
Add a verifier stage that checks claim-to-evidence alignment, citation accuracy, and whether the cited text actually supports the wording.
Use structured outputs, answer sentence, citation IDs, evidence spans, confidence, so auditing is easy.
Evaluate with a groundedness set, citation precision, support rate, hallucination rate, and human review on edge cases.
Example: if policy answers must cite internal docs, I’d require each sentence to reference a doc section, then reject responses where the verifier can’t find textual support.
16. Have you built systems that use function calling or tool use? How did you decide what actions the model could take and how did you validate them?
Yes. I’ve built LLM workflows where the model could call tools for retrieval, ticketing, search, SQL, and internal APIs. The key is to keep the action space narrow and explicit, then validate every step outside the model.
I start from user intents, then map only high value, low ambiguity actions into tools with tight schemas.
Each tool gets clear input constraints, examples, auth boundaries, and idempotency rules where possible.
I separate planning from execution, so the model proposes an action, but deterministic code validates arguments and policy.
Validation includes schema checks, allowlists, permission checks, dry runs, and fallback to clarification if confidence is low.
I test with happy paths, adversarial prompts, malformed arguments, and replay traces to measure tool precision, failure rate, and unsafe execution.
17. What are the common security risks in LLM applications, such as prompt injection, data leakage, and indirect injection, and how would you mitigate them?
I’d frame it as, LLM apps are risky anywhere untrusted input can influence model behavior or expose sensitive data. The key is layered defenses, not one silver bullet.
Prompt injection: users try to override system rules. Mitigate with strict instruction hierarchy, input delimiting, output validation, and least-privilege tool access.
Indirect injection: malicious text hides in docs, web pages, or emails the model reads. Treat retrieved content as untrusted, sanitize it, isolate tools, and require confirmation for high-impact actions.
Data leakage: models may reveal secrets from prompts, memory, logs, or retrieval. Minimize sensitive data, redact PII, encrypt storage, scope retrieval, and scrub logs.
Over-permissioned agents: the model can call tools it should not. Use allowlists, role-based access, approval gates, and audit trails.
Insecure outputs: generated code, SQL, or URLs can be dangerous. Validate, sandbox, and enforce policy checks before execution.
18. What embedding model considerations matter when building search or retrieval for enterprise knowledge bases?
I’d bucket it into retrieval quality, operational fit, and governance.
Domain fit matters most, test models on your own KB, because legal, finance, product docs, and tickets use very different language.
Check embedding quality with recall@k, MRR, and hard-query sets, not just leaderboard scores.
Choose the right granularity, chunk size, overlap, and metadata strategy often impact results as much as model choice.
Dimensionality, latency, and cost matter in production, especially for large corpora and frequent re-indexing.
Multilingual support, code support, and long-document handling are key if your KB is diverse.
Stability matters, model changes can break similarity behavior, so version embeddings and plan reindexing.
Security matters in enterprise, look at data residency, PII handling, vendor retention, and whether self-hosting is needed.
A practical answer in interviews is, “I’d run an offline eval on real queries, compare retrieval quality, then weigh cost, latency, and compliance before picking a model.”
19. Explain the differences between zero-shot, few-shot, chain-of-thought prompting, and tool-augmented prompting. When would you use each?
Here’s the clean way to explain it:
Zero-shot: you give only the task, no examples. Use it for simple, common tasks like classification, summarization, or rewriting.
Few-shot: you include a handful of examples to show the pattern. Use it when formatting, tone, or edge-case behavior matters.
Chain-of-thought: you ask for step-by-step reasoning. Use it for multi-step logic, math, planning, or troubleshooting, though in practice I’d often ask for a concise rationale, not full hidden reasoning.
Tool-augmented prompting: the model can call external tools like search, calculators, databases, or code execution. Use it when accuracy depends on current data, exact computation, or external actions.
Rule of thumb: zero-shot for speed, few-shot for consistency, chain-of-thought for reasoning depth, tools for grounded accuracy and real-world usefulness.
20. How would you design guardrails for an LLM agent that can call APIs, send messages, or modify records?
I’d design guardrails as layered controls, not one filter. The key is to reduce blast radius, require stronger checks for riskier actions, and make every action observable.
Start with action tiers: read, low risk write, high risk write, external communication, privileged admin.
Give the agent least privilege, scoped tokens, per tool allowlists, field-level restrictions, no broad database access.
Add policy checks before execution: validate inputs, detect prompt injection, enforce business rules, block unsafe parameter combos.
Use human approval for irreversible or high impact actions, like deletes, payments, or sending external messages.
Make execution constrained: typed schemas, templates for outbound messages, dry-run mode, rate limits, idempotency keys.
Add monitoring and rollback: full audit logs, anomaly alerts, replay traces, easy revert paths.
Example: if the agent updates CRM records, let it edit notes automatically, but require approval to change owner, status, or billing fields.
21. How would you handle personally identifiable information and sensitive enterprise data in an LLM workflow?
I’d treat it as a defense-in-depth problem: minimize what the model sees, control where data flows, and prove those controls with monitoring and audits.
Start with data classification, tag PII, PHI, secrets, contracts, and decide what is allowed in prompts, retrieval, logs, and training.
Minimize exposure, redact or tokenize sensitive fields before inference, and use least-privilege access for apps, users, and vector stores.
Isolate enterprise data, prefer private deployment or VPC endpoints, encrypt in transit and at rest, and disable vendor training on customer data.
Add guardrails, DLP on prompts and outputs, prompt injection defenses, retrieval filters, and policy checks before responses are returned.
Keep observability safe, avoid raw prompt logging, use hashed identifiers, retention limits, audit trails, and regular red-team testing.
Example: for a support copilot, I’d mask customer SSNs before retrieval, store mappings in a secure vault, and only rehydrate for authorized downstream systems, not for the LLM.
Get Interview Coaching from LLM Experts
Knowing the questions is just the start. Work with experienced professionals who can help you perfect your answers, improve your presentation, and boost your confidence.
Still not convinced? Don't just take our word for it
We've already delivered 1-on-1 mentorship to thousands of students, professionals, managers and executives. Even better, they've left an average rating of 4.9 out of 5 for our mentors.