AI Engineer Interview Questions UK 2026
Thirty real AI engineer interview questions from UK employers in 2026 — DeepMind, Anthropic, Wayve and the banks — with model-answer outlines and structure.
The Short Answer
The UK AI engineer interview process in 2026 typically runs four to five stages over three to six weeks: a recruiter screen, a coding or take-home assessment, one or two technical deep-dives, an ML system design round, and a behavioural or culture interview. Below we cover thirty questions our candidates have actually faced this year, grouped by stage — coding and implementation, ML system design, ML fundamentals, LLM-specific scenarios, and behavioural. Top UK employers in this space include Google DeepMind, Anthropic's London office, Wayve, Faculty AI and Cohere, alongside the bank AI labs at HSBC, Barclays and Lloyds. A strong candidate typically combines PyTorch fluency, a working knowledge of transformer internals, the ability to reason about latency and cost trade-offs in production LLM systems, and a track record of shipping something — ideally with metrics attached. Successful candidates this year are typically taking packages of £130,000–£250,000 in London, with research roles at frontier labs running higher.
How UK AI Engineer Interview Processes Are Structured in 2026
The typical UK AI engineer loop in 2026 has stabilised around five stages, though the order and weight vary by employer.
Recruiter screen (30 minutes). Motivation, package expectations, visa status, notice period. Frontier labs increasingly ask about safety views here too.
Technical screen or take-home (60–120 minutes live, or 4–8 hours async). Either a live coding call — typically PyTorch or NumPy implementation rather than pure leetcode — or a take-home such as fine-tuning a small model on a provided dataset and writing it up.
ML deep-dive (60 minutes). A senior engineer probes your knowledge of model internals: attention, optimisers, distributed training, evaluation. Expect to be asked to derive things on a shared whiteboard.
ML system design (60 minutes). Design a RAG system, a recommendation pipeline, or an evaluation harness for an agent. The bar is reasoning about latency, cost, failure modes and monitoring — not just drawing boxes.
Behavioural and culture (45–60 minutes). Past projects, conflict, ambiguity, and — at safety-focused employers — your views on responsible deployment.
DeepMind and Anthropic often add a research discussion round; the banks usually add a stakeholder interview. End-to-end the process typically takes three to six weeks in London and Cambridge.
The 30 Questions, Grouped by Stage
Coding and Implementation (six questions)
1. Implement scaled dot-product attention in PyTorch from scratch. Model answer: compute QK^T, scale by sqrt(d_k), apply the causal mask before softmax, multiply by V. Mention numerical stability of softmax and where you would use torch.nn.functional.scaled_dot_product_attention in production.
2. Write a custom PyTorch Dataset and DataLoader for streaming a 200GB JSONL file. Discuss iterable datasets, worker sharding, pin_memory, and how you would avoid replaying the same shard across workers.
3. Implement top-k and top-p (nucleus) sampling. Walk through sorting logits, cumulative softmax, masking, and renormalising. Mention temperature and why greedy decoding fails for creative tasks.
4. Given a list of token IDs, build a function that batches them into fixed-length sequences with packing. Discuss padding vs packing, attention mask construction, and the impact on throughput.
5. Debug this training loop that produces NaN losses after 200 steps. Typical answer: check learning rate, gradient clipping, mixed-precision loss scaling, division-by-zero in custom losses, and whether the data contains malformed inputs.
6. Implement a simple LRU cache for prompt prefixes. Useful for KV cache discussions later in the loop. Mention collections.OrderedDict or a doubly-linked list plus hashmap.
ML System Design (six questions)
7. Design a RAG system for a UK bank's customer support. Cover document ingestion, chunking strategy, embedding model choice, vector store (FAISS, pgvector, or managed), retrieval with hybrid BM25 plus dense, reranking, prompt template, guardrails, and an evaluation set. Discuss FCA traceability requirements.
8. Design an evaluation harness for an LLM agent that books trains. Talk about offline golden sets, LLM-as-judge with bias mitigation, replayable browser traces, and online A/B with safety metrics.
9. Design a feature store for a fraud detection model serving 50,000 requests per second. Cover online vs offline parity, point-in-time correctness, and a sub-50ms latency budget.
10. Design a system to fine-tune a 70B model on customer data without leaking PII. Cover differential privacy, PII redaction pipelines, LoRA vs full fine-tune trade-offs, and per-tenant adapters.
11. Design a recommendation system for a streaming platform with a cold-start problem. Two-tower retrieval, embedding-based content fallback, multi-armed bandit for exploration.
12. Design a monitoring system for a production LLM. Discuss latency percentiles, token cost per request, drift detection on input distribution, hallucination flagging via citation checking, and user feedback loops.
ML Fundamentals (six questions)
13. Explain gradient checkpointing and when you would use it. Trade compute for memory by recomputing activations during the backward pass. Useful when training large models on limited VRAM; typically a 20–30% slowdown for 60–70% memory saving.
14. Walk through how AdamW differs from Adam. Decoupled weight decay applied directly to weights rather than through the gradient. Mention why this matters for transformers and the typical hyperparameters.
15. What is the difference between layer norm and RMS norm? RMS norm drops the mean centring; it is faster and used in LLaMA-family models. Discuss numerical behaviour.
16. Explain how rotary position embeddings (RoPE) work. Rotate query and key vectors by a position-dependent angle in pairs of dimensions. Mention why this generalises better than absolute embeddings.
17. What is the difference between data, tensor, pipeline and FSDP parallelism? A short table answer. Mention when you would combine them — typically data plus FSDP for most fine-tuning today.
18. How would you evaluate whether a model is overfitting on a small fine-tuning dataset? Train-validation loss gap, held-out probes, evaluating on adjacent capabilities to check for catastrophic forgetting.
LLM-Specific Scenarios (six questions)
19. How would you reduce hallucination in a production agent? Retrieval grounding with citations, constrained decoding for structured outputs, self-consistency sampling, post-hoc verification with a second model, and clear refusal training.
20. A customer says the chatbot is "leaking" training data. How do you investigate? Reproduce the prompt, check whether it is regurgitation or confabulation, measure with canaries, and discuss membership inference if relevant.
21. Design a prompt evaluation pipeline that costs less than £500 per release. Sampling strategy, judge-model choice, caching previous evaluations, and use of cheap classifiers as first-pass filters.
22. When would you choose fine-tuning over prompting? Volume of examples available, latency budget, format consistency, cost per token at scale. Typically over 1,000 high-quality examples and clear format requirements.
23. How does speculative decoding work? A small draft model proposes tokens; the larger model verifies in parallel. Discuss acceptance rates and typical 2–3x speedups.
24. Walk through KV caching and its memory footprint for a 70B model at 8k context. Calculate roughly: 2 (K,V) num_layers num_heads head_dim seq_len batch dtype_size. Mention paged attention and vLLM.
Behavioural and Culture (six questions)
25. Tell me about a time you shipped an ML system that failed in production. Strong answers name the failure mode (distribution shift, label noise, edge case), describe how you detected it, and what you changed about your process — not just the system.
26. Describe a time you disagreed with a colleague about a modelling approach. Focus on how you ran a cheap experiment to resolve it rather than arguing on priors.
27. Why do you want to work on safety / on capabilities / at a bank? Tailored per employer. Anthropic and DeepMind will probe your views on alignment seriously; banks want to hear about responsible deployment in regulated contexts.
28. Tell me about a paper you read recently that changed how you think. Pick something from the last three months, summarise the claim, and — crucially — say what you would do differently because of it.
29. How do you decide what to work on when given an ambiguous problem? Look for evidence of structured triage: cheapest experiment that disconfirms the riskiest assumption first.
30. Where do you want to be in three years? Concrete is better than grand. "I want to be the person who owns evaluation for a production agent" lands better than "I want to lead an AI org."
What Top UK AI Employers Specifically Look For
Google DeepMind (London, King's Cross). Expect a research-flavoured loop with at least one paper discussion. Engineers are pushed on distributed training, JAX, and the ability to read and critique recent work. Packages typically range £180,000–£400,000+ for senior research engineers.
Anthropic (London). Heavy emphasis on safety thinking, mechanistic interpretability familiarity, and honest reasoning under uncertainty. Strong written-communication bar. Compensation is among the highest in the UK market.
Wayve (London). Self-driving end-to-end models. Expect questions on multimodal architectures, video data pipelines, simulation and real-world evaluation. Strong PyTorch and CUDA bias.
Faculty AI (London). Applied consulting-style work across regulated sectors. The loop emphasises stakeholder reasoning, evaluation rigour and the ability to scope a project under cost constraints.
Cohere (London). Production LLM serving and enterprise RAG. Expect deep questions on inference optimisation, retrieval quality and multilingual evaluation.
HSBC AI Labs, Barclays, Lloyds. Bank AI labs in London, Edinburgh and increasingly Manchester emphasise model risk management, explainability, FCA and PRA compliance, and pragmatic deployment over frontier research. Packages typically £110,000–£180,000 with significant bonus.
Frequently Asked Questions: AI Engineer Interviews UK
How long does interview prep typically take?
Candidates we speak to typically spend four to eight weeks of focused prep alongside a current job, longer if they are pivoting from software engineering into ML. A reasonable split is 40% implementing things in PyTorch, 30% system design practice, 20% paper reading, 10% behavioural rehearsal.
What's the typical take-home assessment?
Usually a small fine-tuning or evaluation task with a write-up: "fine-tune a small model on this dataset, report metrics, discuss what you would do with more compute." Expected effort is typically four to eight hours. The write-up is often weighted more heavily than the code itself.
Do they ask leetcode-style questions?
Less than tier-one tech companies, but not zero. The banks and Faculty AI are most likely to include a leetcode-style round. DeepMind and Anthropic typically prefer ML-flavoured implementation problems — write attention, write a sampler, write a custom loss — over pure algorithmic puzzles.
How important is paper-reading?
Very important at DeepMind, Anthropic, Cohere and Wayve; less so at the banks. A reasonable cadence is two to three papers per week, with at least one you can discuss in depth in any given interview. Pick papers relevant to the team you are interviewing with.
Do they hire without PhDs?
Yes. DeepMind and Anthropic hire strong engineers without PhDs, particularly for engineering-leaning roles. The signal they look for is equivalent depth — typically shown through serious open-source work, published evaluations, or a track record of shipping production ML. PhDs remain more common in pure research roles.
What's the typical salary outcome?
In London and Cambridge in 2026, successful candidates typically take packages of £130,000–£250,000 for senior AI engineer roles, with frontier labs and staff-level positions reaching £300,000–£500,000+ once equity is included. Bank AI lab roles typically sit at £110,000–£180,000 with cash bonuses. Remote-friendly roles outside London typically pay 10–20% less.
Summary
UK AI engineer interviews in 2026 are demanding but predictable: five stages, a stable set of question patterns, and clear differences between frontier labs, scale-ups and bank AI labs. The candidates who do best treat preparation as a portfolio exercise — implementation fluency, system design reasoning, paper familiarity and honest behavioural stories — rather than grinding any single axis. Tailor your prep to the employer: research framing for DeepMind and Anthropic, production rigour for Cohere and Wayve, regulated-deployment thinking for the banks. Start practising on a shared whiteboard early; the format catches more people out than the content does.
Looking for your next AI engineering role? Browse current openings at artificialintelligencejobs.co.uk.