Interview Questions
Data Scientist Interview Questions
Practice data scientist interview questions across statistics, probability, experimentation, causal inference, machine learning, model evaluation, SQL, Python, product analytics, and behavioral communication. Use this as a focused question list alongside the full Data Scientist Interview Guide.
23 questions
8 categories
Data Scientist
Updated May 2026
Statistics and Probability Questions
Statistics questions test whether you understand uncertainty, sampling, inference, distributions, and the assumptions behind conclusions. You do not need to recite formulas blindly; you need to reason correctly.
Framework — Association versus causal effect
Correlation means two variables move together statistically. Causation means changing one variable produces a change in the other, all else equal. Correlation alone does not prove causation because the relationship may be driven by confounding, reverse causality, selection bias, or coincidence. Example: users who receive more push notifications may have higher retention. That does not prove notifications cause retention. More engaged users may naturally receive more notifications or trigger more notification-worthy events. To estimate causality, we need a randomized experiment, natural experiment, instrumental variable, regression discontinuity, difference-in-differences, or a careful causal design with assumptions. A strong data scientist answer also explains the decision risk. If we treat correlation as causation, we might increase notifications and hurt users. The right next step is to design an experiment or causal analysis that isolates the effect of notifications.
Likely follow-ups
What are common sources of confounding?
How would you test whether notifications cause retention?
When is correlation still useful?
Framework — False positive versus false negative
A Type I error is a false positive: rejecting the null hypothesis when it is actually true. In an A/B test, that means shipping a feature because it appears to help when it does not. A Type II error is a false negative: failing to reject the null when a real effect exists. That means missing a feature that actually helps. The significance level alpha controls the Type I error rate. Power, which is 1 - beta, relates to Type II error. Increasing sample size generally improves power. There is a tradeoff: stricter significance thresholds reduce false positives but can increase false negatives if sample size is not adjusted. In product decisions, the acceptable error depends on cost. For a risky checkout change, false positives may be expensive. For a low-risk UI improvement, a slightly higher false positive risk may be acceptable if iteration speed matters.
Likely follow-ups
Which error is worse for a medical diagnosis model?
How does sample size affect Type II error?
What happens if you run many tests at once?
Framework — Hypothesis -> probability -> uncertainty
The null hypothesis is that the coin is fair, with probability of heads equal to 0.5. Getting 8 or more heads in 10 flips is possible under a fair coin, so we should not immediately conclude bias. For a two-sided test, we would consider outcomes at least as extreme as 8 heads or 2 heads. The probability of 8, 9, or 10 heads is (45 + 10 + 1) / 1024 = 56 / 1024, about 5.5%. Doubling for the lower tail gives about 10.9%. That is not below a 5% significance threshold. The conclusion is that 10 flips is a small sample. The result is suggestive but not strong enough evidence to confidently say the coin is biased. We should collect more data if the decision matters.
Likely follow-ups
How many flips would you want?
What if it landed heads 80 times out of 100?
Would you use a one-sided or two-sided test?
Experimentation and Causal Inference
Experimentation questions evaluate whether you can design clean tests, interpret results, avoid false conclusions, and connect experimental evidence to product decisions.
Framework — Hypothesis -> unit -> metrics -> guardrails -> duration -> decision
First define the hypothesis. For example, the new ranking algorithm improves user satisfaction by showing more relevant items without hurting diversity, latency, or long-term retention. Randomization unit matters. If users see ranked content repeatedly, randomize at the user level so each user gets a consistent experience. If there are network effects or marketplace spillovers, simple user-level randomization may not be enough and we may need cluster randomization or careful holdouts. Primary metric should reflect the goal: meaningful engagement, conversion, successful sessions, downstream retention, or revenue depending on the product. Secondary metrics might include click-through rate, dwell time, saves, purchases, or hides. Guardrails should include latency, complaint rate, diversity, creator/seller fairness, unsubscribe or churn, and any quality metric that could be gamed. Run the test long enough to cover weekly seasonality and reach required sample size. Before shipping, check novelty effects, segment differences, guardrail health, and whether the metric lift is practically meaningful, not only statistically significant.
Likely follow-ups
What if CTR improves but retention drops?
How would you handle interference between users?
What segments would you inspect before launch?
Framework — Validate -> segment reliability -> business impact -> rollout strategy
First validate that the segment result is real. Check sample size, confidence intervals, pre-specified segments, multiple testing risk, and whether new users were properly classified. A noisy subgroup should not override a strong overall result without evidence. If the new-user decline is reliable, understand the mechanism. The feature may benefit experienced users who understand the product but confuse new users. New users often need simpler onboarding, more explanation, or a different default. Recommendation depends on magnitude and strategic importance. If new users are critical to growth, I would not ship globally. I might launch only to existing users, create a new-user-specific variant, or run a follow-up experiment with onboarding changes. If the negative effect is small and short-lived while long-term retention improves, I would investigate further before blocking. The answer should show that you can balance aggregate metrics with heterogeneous treatment effects and business context.
Likely follow-ups
How do you avoid false discoveries in segment analysis?
What if the new-user segment is small?
How would you design the follow-up test?
Framework — Non-randomized treatment with comparable control trend
Difference-in-differences is useful when randomization is not feasible, such as a policy change, regional rollout, pricing change, or operational change that affects one group but not another. It compares the change over time in the treated group to the change over time in a control group. The key assumption is parallel trends: without treatment, the treated and control groups would have moved similarly. We should inspect pre-treatment trends to see whether this assumption is plausible. If the treated group was already trending differently, the estimate may be biased. Example: a feature launches in Canada but not the U.S. We compare Canadian retention before and after launch to U.S. retention before and after the same period. The difference in changes estimates the treatment effect if the control group captures seasonality and external factors. I would communicate the result more cautiously than a randomized experiment because causal validity depends on assumptions that cannot be fully proven.
Likely follow-ups
How do you test the parallel trends assumption?
What could violate difference-in-differences?
How would you choose a control group?
Machine Learning Interview Questions
Machine learning interviews test whether you can frame a prediction problem, choose reasonable baselines, engineer features, evaluate models, and understand why a model may fail in production.
Framework — Define label -> build features -> baseline -> evaluate -> deploy carefully
First define churn. For a subscription product, churn may mean cancellation, payment failure, or no renewal within a time window. The prediction time should be before the churn event, such as predicting whether an active user will churn in the next 30 days. Create a training dataset at a consistent snapshot date. Features could include usage frequency, recency, feature adoption, support tickets, billing issues, plan type, tenure, engagement trend, seat utilization, and prior downgrades. Be careful to avoid leakage: do not include features that occur after the prediction timestamp or directly encode the churn outcome. Start with a simple baseline like logistic regression or gradient boosted trees depending on interpretability and performance needs. Evaluate with AUC, precision/recall, calibration, lift at top deciles, and business impact of interventions. Accuracy alone is often misleading if churn is rare. Deployment requires actionability. A churn score is useful only if the company can intervene. Monitor model drift, fairness across segments, intervention effectiveness, and whether the model identifies users who can actually be saved.
Likely follow-ups
What is label leakage in this problem?
Would you optimize precision or recall?
How would you prove the model creates business value?
Framework — Training fit versus generalization
Overfitting happens when a model learns noise or idiosyncrasies in the training data instead of patterns that generalize. It performs well on training data but poorly on unseen data. Ways to reduce overfitting include using train/validation/test splits, cross-validation, regularization, simpler models, pruning trees, early stopping, dropout for neural networks, more data, feature selection, and proper hyperparameter tuning. Data leakage can look like excellent performance but fail in production, so leakage checks are also essential. The right prevention depends on the model and problem. For a high-dimensional sparse model, regularization may help. For gradient boosted trees, depth, learning rate, number of estimators, and early stopping matter. For time series or user behavior data, validation must respect time order to avoid training on the future.
Likely follow-ups
How do you detect overfitting?
What is the difference between validation and test sets?
Can a model underfit and overfit at the same time?
Framework — Interpretability, nonlinearity, data size, performance, deployment
Logistic regression is a strong baseline for binary classification. It is fast, interpretable, easier to calibrate, and works well when relationships are roughly linear after feature engineering. It is often a good choice when stakeholders need clear explanations or when data is limited. Random forests can capture nonlinear relationships and feature interactions without as much manual specification. They may perform better on complex tabular data but are less interpretable, can be larger to serve, and may not extrapolate well outside the training distribution. I would compare them using the same train/validation split, proper metrics, calibration, inference cost, and business constraints. If logistic regression performs nearly as well and interpretability matters, choose it. If random forest provides a meaningful lift and can be explained and deployed responsibly, use it or compare with gradient boosted trees. The best answer is not that one model is always better. It depends on objective, data, constraints, and actionability.
Likely follow-ups
How would you explain random forest predictions?
When does interpretability matter more than accuracy?
Why might gradient boosting outperform random forests?
Framework — Data mismatch -> leakage -> metric mismatch -> drift -> implementation -> feedback loops
Several failure modes are possible. The offline dataset may not match production traffic. There may be training-serving skew where features are computed differently online than offline. The validation split may have leaked future information or failed to respect time. The offline metric may not match the product objective. Production data may drift: user behavior, seasonality, acquisition channels, inventory, pricing, or external events can change. The model may also create feedback loops. For example, a recommendation model changes what users see, which changes future training data. Implementation issues are common: missing features, default values, latency timeouts, feature freshness problems, incorrect thresholding, or model version mismatch. Segment performance may also be poor even if the aggregate offline metric looked good. I would compare offline and online feature distributions, prediction distributions, calibration, segment metrics, logs, and business outcomes. Then decide whether to roll back, adjust thresholds, fix feature pipelines, retrain, or redesign the objective.
Likely follow-ups
What is training-serving skew?
How would you monitor model drift?
How do feedback loops affect recommender systems?
Model Evaluation and Metrics
Model evaluation questions test whether you can choose metrics that match the business problem. A model can have impressive accuracy and still be useless if the metric is wrong.
Framework — Cost of false positives versus false negatives
Optimize precision when false positives are expensive. For example, if a fraud model blocks legitimate customers, false positives create customer harm and revenue loss. High precision means that when the model flags something, it is usually correct. Optimize recall when false negatives are expensive. For example, in medical screening or severe fraud detection, missing a true positive can be more costly than investigating extra false positives. High recall means the model catches most actual positives. Most real systems require a tradeoff. The threshold should be chosen based on business costs, operational capacity, user harm, and downstream workflow. I would usually evaluate precision-recall curves, not just a single threshold, especially for imbalanced classes.
Likely follow-ups
Why can accuracy be misleading for rare events?
How would you choose a threshold?
What metric would you use for fraud detection?
Framework — Probability quality, not just ranking quality
Calibration means predicted probabilities match observed frequencies. If a calibrated model assigns 0.8 probability to 1,000 examples, about 800 should actually be positive. Calibration matters when probabilities drive decisions: risk scoring, pricing, medical triage, fraud review queues, churn interventions, or expected value calculations. A model can rank examples well with high AUC but still produce poorly calibrated probabilities. We can inspect calibration curves or reliability diagrams and metrics like Brier score. Calibration methods include Platt scaling, isotonic regression, and temperature scaling. However, calibration should be checked on validation data that reflects production distribution. In an interview, emphasize that not every application needs perfectly calibrated probabilities. If the model only ranks content, ranking metrics may matter more. If the number is interpreted as risk, calibration becomes critical.
Likely follow-ups
Can a model have high AUC and poor calibration?
How would you improve calibration?
When does calibration matter for business decisions?
SQL and Python Questions
Data scientist interviews often include SQL and Python because strong modeling work still depends on accurate data extraction, transformation, validation, and exploratory analysis.
Framework — Cohort month -> activity month -> month offset -> aggregate
Create a cohort CTE with each user and their signup month. Create an activity CTE with distinct user_id and activity month for qualifying active events. Join activity to cohort by user_id, then calculate month_number as the difference between activity month and signup month. Group by cohort month and month_number, counting distinct active users. The denominator is the number of users in the original cohort. The numerator for month N is users from that cohort active in month N. Use a left join if you need to preserve months with zero retained users. Important details: define active event, exclude test users, handle users who signed up near month boundaries, use consistent timezone, and avoid counting multiple activity events per user-month. The result should be a cohort table where each row is a signup month and each column or row offset is retention month.
Likely follow-ups
How would you calculate rolling retention instead?
How would you visualize cohort retention?
What if activity data arrives late?
Framework — Quantify -> segment -> diagnose -> decide treatment
I would first quantify missingness by column: count, percentage, and data type. In pandas, df.isna().sum() and df.isna().mean() give a quick profile. Then I would inspect whether missingness is concentrated by time, segment, source, device, geography, or target label. The key question is why values are missing. Missing completely at random is different from missing because a user skipped a field, tracking failed, a device does not support an event, or a value is not applicable. Treatment depends on cause and model needs. Options include keeping missing as its own category, imputing with median or mode, using model-based imputation, excluding rows, or fixing upstream data collection. For modeling, I would fit imputation only on training data and apply it to validation/test to avoid leakage. I would also evaluate whether missingness itself is predictive. For example, missing income in a credit dataset or missing profile fields in a consumer product can carry signal, but using it may raise fairness or compliance concerns.
Likely follow-ups
When would missingness be informative?
How do you avoid leakage during imputation?
What would you do if the target label has missing values?
Framework — Detect -> diagnose -> decide by business meaning
Outlier detection methods include summary statistics, histograms, box plots, z-scores, IQR rules, percentile thresholds, and model-based approaches. But detection is only the first step. The important question is whether the outlier is an error, a rare but valid case, or the most important part of the distribution. I would inspect outliers by source, timestamp, segment, and raw records. A negative age is likely a data error. A very large enterprise purchase may be valid and should not be removed from revenue analysis without reason. For modeling, outliers may require transformation, winsorization, robust models, or segment-specific treatment. Remove outliers only when there is a defensible reason: impossible value, duplicate event, instrumentation bug, test account, or records outside analysis scope. If valid outliers affect the conclusion, report sensitivity with and without them.
Likely follow-ups
What is winsorization?
How can outliers affect linear regression?
When are outliers the signal?
Product Analytics and Business Case Questions
Many data scientist roles are embedded with product teams. These interviews test whether you can translate product ambiguity into metrics, analysis plans, experiments, and decisions.
Framework — Metric quality -> funnel diagnosis -> segment -> mechanism -> action
Clicks may be a weak proxy for value. First verify the result and define the funnel: impressions, clicks, add-to-cart, checkout, purchase, refunds, and repeat behavior. If clicks increased but purchases decreased, the recommendation may be attracting curiosity without purchase intent or distracting users from better paths. Segment by user type, traffic source, product category, device, and recommendation surface. New users might click more because the module is prominent but find irrelevant items. Returning users might be disrupted from their normal purchase flow. Inspect recommendation quality: relevance, price mismatch, availability, delivery time, diversity, and whether recommended items are out of stock or low margin. Also check latency and page layout changes. Recommendation: do not ship based on clicks alone. Either roll back, limit exposure to segments where purchases are healthy, or redesign the objective to optimize downstream purchase quality instead of click-through rate.
Likely follow-ups
What primary metric would you choose?
How would you measure recommendation quality?
What if revenue increased but purchases decreased?
Framework — Liquidity -> quality -> balance -> retention -> unit economics
Marketplace health depends on both sides. The core concept is liquidity: can demand find suitable supply quickly, and can supply find enough demand to stay engaged? Metrics depend on marketplace type. For rideshare: match rate, time to match, ETA, cancellation rate, driver utilization, rider repeat rate, price surge frequency, and geographic coverage. For freelance marketplaces: percentage of jobs receiving qualified bids, time to first bid, hire rate, project completion, dispute rate, repeat hiring, and provider utilization. Segment by geography, category, time of day, user cohort, supply tier, and demand intent. Marketplace averages hide local imbalance. A marketplace can look healthy overall while failing in a specific city or category. Guardrails include trust and safety, fraud, quality complaints, refunds, churn on either side, and unit economics. Recommendations should identify whether the constrained side is supply or demand because growth levers differ completely.
Likely follow-ups
How would you solve a cold-start problem?
How do you know which side is constrained?
What metric would you show leadership weekly?
Framework — User value -> repeated behavior -> learning outcome -> guardrails
A North Star metric should capture durable user value, not just activity. For a language learning app, daily sessions alone may be too shallow because users can open the app without learning. A better candidate might be weekly active learners who complete a meaningful lesson with sufficient accuracy, or weekly learning minutes that meet quality criteria. I would consider the core value: helping users make progress in a language. Input metrics could include lesson starts, lesson completions, streaks, accuracy, review completion, speaking practice, and level progression. Outcome metrics could include retention, subscription conversion, placement improvement, or external proficiency assessments if available. Guardrails: burnout, low-quality rapid completions, cheating, notification opt-outs, churn, and user frustration. If the metric only rewards more time spent, it may encourage grind rather than learning. I would define the North Star with product and learning science stakeholders, then validate whether it predicts retention and user-reported progress.
Likely follow-ups
Why not use DAU as the North Star?
How would you prevent gaming the metric?
How would this differ for casual versus serious learners?
Modeling Case Studies
Modeling case interviews test the full data science workflow: problem framing, labels, features, baseline, evaluation, deployment, monitoring, and business value.
Framework — Objective -> labels -> features -> metrics -> intervention -> monitoring
First define fraud and the action. Are we blocking transactions, sending them to manual review, requiring step-up authentication, or scoring risk? The model objective should match the intervention because false positives can hurt legitimate customers. Labels may come from chargebacks, confirmed fraud investigations, user reports, or rule-based flags. Labels are delayed and imperfect, so account for label latency and noise. Features might include transaction amount, merchant, device, IP/geography mismatch, account age, velocity, payment method history, failed attempts, shipping distance, and prior disputes. Start with rules and a simple baseline, then compare models such as logistic regression, gradient boosted trees, or anomaly detection depending on label quality. Evaluation should emphasize precision-recall, recall at a fixed review capacity, false positive rate for legitimate users, dollar-weighted fraud caught, and calibration. Deployment requires monitoring drift, adversarial adaptation, fairness, latency, manual review capacity, feedback loops, and rollback. A fraud model is not just a prediction problem; it is an operational system.
Likely follow-ups
How would you handle delayed labels?
What is the cost of false positives?
How would fraudsters adapt?
Framework — Target definition -> feature availability -> metric -> calibration -> user impact
Define the target as actual delivery duration from order confirmation to arrival, or break it into components: preparation time, courier assignment, pickup time, travel time, and handoff. Component modeling can be more interpretable and operationally useful. Features available at prediction time may include restaurant, cuisine, time of day, day of week, weather, distance, courier supply, current backlog, historical prep times, traffic, order size, and geographic zone. Avoid using features not known at prediction time, such as actual pickup time if predicting at order placement. Metrics should reflect user experience. MAE is easy to interpret, but underestimation may be worse than overestimation. Calibration matters: if we promise a delivery window, the order should arrive within that window at the expected rate. Deployment concerns include latency, real-time feature freshness, cold-start restaurants, holidays, weather shocks, and feedback loops from quoted ETAs influencing user decisions. Monitor error by restaurant, zone, time, and customer segment.
Likely follow-ups
Would you model total time or components?
How would you handle new restaurants?
What is worse: overestimating or underestimating ETA?
Behavioral and Communication Questions
Behavioral data science interviews focus on ambiguity, influence, technical communication, project impact, ethical judgment, and cases where the data did not support the preferred narrative.
Framework — Problem -> method -> decision -> impact -> lesson
Choose a project where your work changed a decision, product, process, or metric. Start with the business problem and why it mattered. Then explain your method at the right level: data sources, analysis or model, validation, and tradeoffs. The strongest answer connects technical work to business action. For example, a churn model helped customer success prioritize outreach, an experiment changed a launch decision, or a pricing analysis improved revenue without hurting conversion. Quantify impact if possible: revenue lift, churn reduction, time saved, cost reduction, better targeting, or improved decision speed. Also explain limitations and what you would improve next. Avoid making the answer a tool walkthrough. The interviewer cares less about which library you used and more about whether your work was correct, trusted, adopted, and useful.
Likely follow-ups
How did you measure impact?
What was the hardest technical challenge?
How did you get stakeholder buy-in?
Framework — Decision -> intuition -> evidence -> limitations -> action
I start with the decision the stakeholder needs to make, not the model architecture. Then I explain the model intuition in plain language: what it predicts, what signals it uses, how accurate it is, and where it should not be trusted. I use examples, feature importance or reason codes when appropriate, calibration plots, and simple tradeoff charts like precision versus recall. I avoid pretending the model is magic. Stakeholders need to understand limitations: data quality, uncertainty, bias, drift, and edge cases. Finally, I connect the model to action. For example, “This model identifies the top 10% of accounts most likely to churn, but the value comes from testing whether outreach to that group improves retention.” This keeps the conversation focused on decisions and outcomes.
Likely follow-ups
What if stakeholders only care about accuracy?
How do you communicate uncertainty?
When would you avoid using a complex model?
Framework — Expectation -> evidence -> communication -> decision -> outcome
Pick a real example where evidence challenged a preferred plan. Explain what stakeholders expected, what data you analyzed, and why the conclusion differed. Be specific about validation because credibility matters in these situations. Then describe how you communicated the finding. Strong answers show diplomacy: acknowledge the stakeholder goal, present evidence clearly, explain uncertainty, and offer alternatives. The goal is not to win an argument; it is to help the team make a better decision. Close with the outcome. Maybe the team delayed launch, changed targeting, ran a smaller test, or chose a different metric. If stakeholders ignored the recommendation, explain what happened and what you learned about influence and communication.
Likely follow-ups
How did you preserve the relationship?
What if leadership disagreed?
How did you make the analysis more trustworthy?
Practice these answers live
Interview Pilot gives you real-time Copilot answer suggestions during live interviews, so you can respond clearly when these questions come up.
