Question 1

Explain the difference between correlation and causation.

Accepted Answer

Correlation means two variables move together statistically. Causation means changing one variable produces a change in the other, all else equal. Correlation alone does not prove causation because the relationship may be driven by confounding, reverse causality, selection bias, or coincidence.

Example: users who receive more push notifications may have higher retention. That does not prove notifications cause retention. More engaged users may naturally receive more notifications or trigger more notification-worthy events. To estimate causality, we need a randomized experiment, natural experiment, instrumental variable, regression discontinuity, difference-in-differences, or a careful causal design with assumptions.

A strong data scientist answer also explains the decision risk. If we treat correlation as causation, we might increase notifications and hurt users. The right next step is to design an experiment or causal analysis that isolates the effect of notifications.

Question 2

What is the difference between Type I and Type II error?

Accepted Answer

A Type I error is a false positive: rejecting the null hypothesis when it is actually true. In an A/B test, that means shipping a feature because it appears to help when it does not. A Type II error is a false negative: failing to reject the null when a real effect exists. That means missing a feature that actually helps.

The significance level alpha controls the Type I error rate. Power, which is 1 - beta, relates to Type II error. Increasing sample size generally improves power. There is a tradeoff: stricter significance thresholds reduce false positives but can increase false negatives if sample size is not adjusted.

In product decisions, the acceptable error depends on cost. For a risky checkout change, false positives may be expensive. For a low-risk UI improvement, a slightly higher false positive risk may be acceptable if iteration speed matters.

Question 3

A coin is flipped 10 times and lands heads 8 times. Is the coin biased?

Accepted Answer

The null hypothesis is that the coin is fair, with probability of heads equal to 0.5. Getting 8 or more heads in 10 flips is possible under a fair coin, so we should not immediately conclude bias.

For a two-sided test, we would consider outcomes at least as extreme as 8 heads or 2 heads. The probability of 8, 9, or 10 heads is (45 + 10 + 1) / 1024 = 56 / 1024, about 5.5%. Doubling for the lower tail gives about 10.9%. That is not below a 5% significance threshold.

The conclusion is that 10 flips is a small sample. The result is suggestive but not strong enough evidence to confidently say the coin is biased. We should collect more data if the decision matters.

Question 4

How would you design an A/B test for a new ranking algorithm?

Accepted Answer

First define the hypothesis. For example, the new ranking algorithm improves user satisfaction by showing more relevant items without hurting diversity, latency, or long-term retention.

Randomization unit matters. If users see ranked content repeatedly, randomize at the user level so each user gets a consistent experience. If there are network effects or marketplace spillovers, simple user-level randomization may not be enough and we may need cluster randomization or careful holdouts.

Primary metric should reflect the goal: meaningful engagement, conversion, successful sessions, downstream retention, or revenue depending on the product. Secondary metrics might include click-through rate, dwell time, saves, purchases, or hides. Guardrails should include latency, complaint rate, diversity, creator/seller fairness, unsubscribe or churn, and any quality metric that could be gamed.

Run the test long enough to cover weekly seasonality and reach required sample size. Before shipping, check novelty effects, segment differences, guardrail health, and whether the metric lift is practically meaningful, not only statistically significant.

Question 5

An A/B test is positive overall but negative for new users. What do you recommend?

Accepted Answer

First validate that the segment result is real. Check sample size, confidence intervals, pre-specified segments, multiple testing risk, and whether new users were properly classified. A noisy subgroup should not override a strong overall result without evidence.

If the new-user decline is reliable, understand the mechanism. The feature may benefit experienced users who understand the product but confuse new users. New users often need simpler onboarding, more explanation, or a different default.

Recommendation depends on magnitude and strategic importance. If new users are critical to growth, I would not ship globally. I might launch only to existing users, create a new-user-specific variant, or run a follow-up experiment with onboarding changes. If the negative effect is small and short-lived while long-term retention improves, I would investigate further before blocking.

The answer should show that you can balance aggregate metrics with heterogeneous treatment effects and business context.

Question 6

When would you use difference-in-differences instead of an A/B test?

Accepted Answer

Difference-in-differences is useful when randomization is not feasible, such as a policy change, regional rollout, pricing change, or operational change that affects one group but not another. It compares the change over time in the treated group to the change over time in a control group.

The key assumption is parallel trends: without treatment, the treated and control groups would have moved similarly. We should inspect pre-treatment trends to see whether this assumption is plausible. If the treated group was already trending differently, the estimate may be biased.

Example: a feature launches in Canada but not the U.S. We compare Canadian retention before and after launch to U.S. retention before and after the same period. The difference in changes estimates the treatment effect if the control group captures seasonality and external factors.

I would communicate the result more cautiously than a randomized experiment because causal validity depends on assumptions that cannot be fully proven.

Question 7

How would you build a churn prediction model?

Accepted Answer

First define churn. For a subscription product, churn may mean cancellation, payment failure, or no renewal within a time window. The prediction time should be before the churn event, such as predicting whether an active user will churn in the next 30 days.

Create a training dataset at a consistent snapshot date. Features could include usage frequency, recency, feature adoption, support tickets, billing issues, plan type, tenure, engagement trend, seat utilization, and prior downgrades. Be careful to avoid leakage: do not include features that occur after the prediction timestamp or directly encode the churn outcome.

Start with a simple baseline like logistic regression or gradient boosted trees depending on interpretability and performance needs. Evaluate with AUC, precision/recall, calibration, lift at top deciles, and business impact of interventions. Accuracy alone is often misleading if churn is rare.

Deployment requires actionability. A churn score is useful only if the company can intervene. Monitor model drift, fairness across segments, intervention effectiveness, and whether the model identifies users who can actually be saved.

Question 8

Explain overfitting and how to prevent it.

Accepted Answer

Overfitting happens when a model learns noise or idiosyncrasies in the training data instead of patterns that generalize. It performs well on training data but poorly on unseen data.

Ways to reduce overfitting include using train/validation/test splits, cross-validation, regularization, simpler models, pruning trees, early stopping, dropout for neural networks, more data, feature selection, and proper hyperparameter tuning. Data leakage can look like excellent performance but fail in production, so leakage checks are also essential.

The right prevention depends on the model and problem. For a high-dimensional sparse model, regularization may help. For gradient boosted trees, depth, learning rate, number of estimators, and early stopping matter. For time series or user behavior data, validation must respect time order to avoid training on the future.

Question 9

How would you choose between logistic regression and a random forest?

Accepted Answer

Logistic regression is a strong baseline for binary classification. It is fast, interpretable, easier to calibrate, and works well when relationships are roughly linear after feature engineering. It is often a good choice when stakeholders need clear explanations or when data is limited.

Random forests can capture nonlinear relationships and feature interactions without as much manual specification. They may perform better on complex tabular data but are less interpretable, can be larger to serve, and may not extrapolate well outside the training distribution.

I would compare them using the same train/validation split, proper metrics, calibration, inference cost, and business constraints. If logistic regression performs nearly as well and interpretability matters, choose it. If random forest provides a meaningful lift and can be explained and deployed responsibly, use it or compare with gradient boosted trees.

The best answer is not that one model is always better. It depends on objective, data, constraints, and actionability.

Question 10

A model performs well offline but poorly after launch. What could be wrong?

Accepted Answer

Several failure modes are possible. The offline dataset may not match production traffic. There may be training-serving skew where features are computed differently online than offline. The validation split may have leaked future information or failed to respect time. The offline metric may not match the product objective.

Production data may drift: user behavior, seasonality, acquisition channels, inventory, pricing, or external events can change. The model may also create feedback loops. For example, a recommendation model changes what users see, which changes future training data.

Implementation issues are common: missing features, default values, latency timeouts, feature freshness problems, incorrect thresholding, or model version mismatch. Segment performance may also be poor even if the aggregate offline metric looked good.

I would compare offline and online feature distributions, prediction distributions, calibration, segment metrics, logs, and business outcomes. Then decide whether to roll back, adjust thresholds, fix feature pipelines, retrain, or redesign the objective.

Question 11

When would you optimize precision versus recall?

Accepted Answer

Optimize precision when false positives are expensive. For example, if a fraud model blocks legitimate customers, false positives create customer harm and revenue loss. High precision means that when the model flags something, it is usually correct.

Optimize recall when false negatives are expensive. For example, in medical screening or severe fraud detection, missing a true positive can be more costly than investigating extra false positives. High recall means the model catches most actual positives.

Most real systems require a tradeoff. The threshold should be chosen based on business costs, operational capacity, user harm, and downstream workflow. I would usually evaluate precision-recall curves, not just a single threshold, especially for imbalanced classes.

Question 12

What is model calibration and why does it matter?

Accepted Answer

Calibration means predicted probabilities match observed frequencies. If a calibrated model assigns 0.8 probability to 1,000 examples, about 800 should actually be positive.

Calibration matters when probabilities drive decisions: risk scoring, pricing, medical triage, fraud review queues, churn interventions, or expected value calculations. A model can rank examples well with high AUC but still produce poorly calibrated probabilities.

We can inspect calibration curves or reliability diagrams and metrics like Brier score. Calibration methods include Platt scaling, isotonic regression, and temperature scaling. However, calibration should be checked on validation data that reflects production distribution.

In an interview, emphasize that not every application needs perfectly calibrated probabilities. If the model only ranks content, ranking metrics may matter more. If the number is interpreted as risk, calibration becomes critical.

Question 13

Write SQL to calculate monthly retention by signup cohort.

Accepted Answer

Create a cohort CTE with each user and their signup month. Create an activity CTE with distinct user_id and activity month for qualifying active events. Join activity to cohort by user_id, then calculate month_number as the difference between activity month and signup month. Group by cohort month and month_number, counting distinct active users.

The denominator is the number of users in the original cohort. The numerator for month N is users from that cohort active in month N. Use a left join if you need to preserve months with zero retained users.

Important details: define active event, exclude test users, handle users who signed up near month boundaries, use consistent timezone, and avoid counting multiple activity events per user-month. The result should be a cohort table where each row is a signup month and each column or row offset is retention month.

Question 14

In Python, how would you investigate missing values in a dataset?

Accepted Answer

I would first quantify missingness by column: count, percentage, and data type. In pandas, df.isna().sum() and df.isna().mean() give a quick profile. Then I would inspect whether missingness is concentrated by time, segment, source, device, geography, or target label.

The key question is why values are missing. Missing completely at random is different from missing because a user skipped a field, tracking failed, a device does not support an event, or a value is not applicable.

Treatment depends on cause and model needs. Options include keeping missing as its own category, imputing with median or mode, using model-based imputation, excluding rows, or fixing upstream data collection. For modeling, I would fit imputation only on training data and apply it to validation/test to avoid leakage.

I would also evaluate whether missingness itself is predictive. For example, missing income in a credit dataset or missing profile fields in a consumer product can carry signal, but using it may raise fairness or compliance concerns.

Question 15

How would you detect outliers, and when should you remove them?

Accepted Answer

Outlier detection methods include summary statistics, histograms, box plots, z-scores, IQR rules, percentile thresholds, and model-based approaches. But detection is only the first step. The important question is whether the outlier is an error, a rare but valid case, or the most important part of the distribution.

I would inspect outliers by source, timestamp, segment, and raw records. A negative age is likely a data error. A very large enterprise purchase may be valid and should not be removed from revenue analysis without reason. For modeling, outliers may require transformation, winsorization, robust models, or segment-specific treatment.

Remove outliers only when there is a defensible reason: impossible value, duplicate event, instrumentation bug, test account, or records outside analysis scope. If valid outliers affect the conclusion, report sensitivity with and without them.

Question 16

A recommendation feature increased clicks but decreased purchases. What do you do?

Accepted Answer

Clicks may be a weak proxy for value. First verify the result and define the funnel: impressions, clicks, add-to-cart, checkout, purchase, refunds, and repeat behavior. If clicks increased but purchases decreased, the recommendation may be attracting curiosity without purchase intent or distracting users from better paths.

Segment by user type, traffic source, product category, device, and recommendation surface. New users might click more because the module is prominent but find irrelevant items. Returning users might be disrupted from their normal purchase flow.

Inspect recommendation quality: relevance, price mismatch, availability, delivery time, diversity, and whether recommended items are out of stock or low margin. Also check latency and page layout changes.

Recommendation: do not ship based on clicks alone. Either roll back, limit exposure to segments where purchases are healthy, or redesign the objective to optimize downstream purchase quality instead of click-through rate.

Question 17

How would you measure marketplace health?

Accepted Answer

Marketplace health depends on both sides. The core concept is liquidity: can demand find suitable supply quickly, and can supply find enough demand to stay engaged?

Metrics depend on marketplace type. For rideshare: match rate, time to match, ETA, cancellation rate, driver utilization, rider repeat rate, price surge frequency, and geographic coverage. For freelance marketplaces: percentage of jobs receiving qualified bids, time to first bid, hire rate, project completion, dispute rate, repeat hiring, and provider utilization.

Segment by geography, category, time of day, user cohort, supply tier, and demand intent. Marketplace averages hide local imbalance. A marketplace can look healthy overall while failing in a specific city or category.

Guardrails include trust and safety, fraud, quality complaints, refunds, churn on either side, and unit economics. Recommendations should identify whether the constrained side is supply or demand because growth levers differ completely.

Question 18

How would you define a North Star metric for a language learning app?

Accepted Answer

A North Star metric should capture durable user value, not just activity. For a language learning app, daily sessions alone may be too shallow because users can open the app without learning. A better candidate might be weekly active learners who complete a meaningful lesson with sufficient accuracy, or weekly learning minutes that meet quality criteria.

I would consider the core value: helping users make progress in a language. Input metrics could include lesson starts, lesson completions, streaks, accuracy, review completion, speaking practice, and level progression. Outcome metrics could include retention, subscription conversion, placement improvement, or external proficiency assessments if available.

Guardrails: burnout, low-quality rapid completions, cheating, notification opt-outs, churn, and user frustration. If the metric only rewards more time spent, it may encourage grind rather than learning.

I would define the North Star with product and learning science stakeholders, then validate whether it predicts retention and user-reported progress.

Question 19

Design a fraud detection model for online payments.

Accepted Answer

First define fraud and the action. Are we blocking transactions, sending them to manual review, requiring step-up authentication, or scoring risk? The model objective should match the intervention because false positives can hurt legitimate customers.

Labels may come from chargebacks, confirmed fraud investigations, user reports, or rule-based flags. Labels are delayed and imperfect, so account for label latency and noise. Features might include transaction amount, merchant, device, IP/geography mismatch, account age, velocity, payment method history, failed attempts, shipping distance, and prior disputes.

Start with rules and a simple baseline, then compare models such as logistic regression, gradient boosted trees, or anomaly detection depending on label quality. Evaluation should emphasize precision-recall, recall at a fixed review capacity, false positive rate for legitimate users, dollar-weighted fraud caught, and calibration.

Deployment requires monitoring drift, adversarial adaptation, fairness, latency, manual review capacity, feedback loops, and rollback. A fraud model is not just a prediction problem; it is an operational system.

Question 20

Build a model to predict estimated delivery time.

Accepted Answer

Define the target as actual delivery duration from order confirmation to arrival, or break it into components: preparation time, courier assignment, pickup time, travel time, and handoff. Component modeling can be more interpretable and operationally useful.

Features available at prediction time may include restaurant, cuisine, time of day, day of week, weather, distance, courier supply, current backlog, historical prep times, traffic, order size, and geographic zone. Avoid using features not known at prediction time, such as actual pickup time if predicting at order placement.

Metrics should reflect user experience. MAE is easy to interpret, but underestimation may be worse than overestimation. Calibration matters: if we promise a delivery window, the order should arrive within that window at the expected rate.

Deployment concerns include latency, real-time feature freshness, cold-start restaurants, holidays, weather shocks, and feedback loops from quoted ETAs influencing user decisions. Monitor error by restaurant, zone, time, and customer segment.

Question 21

Tell me about a data science project that had business impact.

Accepted Answer

Choose a project where your work changed a decision, product, process, or metric. Start with the business problem and why it mattered. Then explain your method at the right level: data sources, analysis or model, validation, and tradeoffs.

The strongest answer connects technical work to business action. For example, a churn model helped customer success prioritize outreach, an experiment changed a launch decision, or a pricing analysis improved revenue without hurting conversion.

Quantify impact if possible: revenue lift, churn reduction, time saved, cost reduction, better targeting, or improved decision speed. Also explain limitations and what you would improve next.

Avoid making the answer a tool walkthrough. The interviewer cares less about which library you used and more about whether your work was correct, trusted, adopted, and useful.

Question 22

How do you explain a complex model to non-technical stakeholders?

Accepted Answer

I start with the decision the stakeholder needs to make, not the model architecture. Then I explain the model intuition in plain language: what it predicts, what signals it uses, how accurate it is, and where it should not be trusted.

I use examples, feature importance or reason codes when appropriate, calibration plots, and simple tradeoff charts like precision versus recall. I avoid pretending the model is magic. Stakeholders need to understand limitations: data quality, uncertainty, bias, drift, and edge cases.

Finally, I connect the model to action. For example, “This model identifies the top 10% of accounts most likely to churn, but the value comes from testing whether outreach to that group improves retention.” This keeps the conversation focused on decisions and outcomes.

Question 23

Tell me about a time your analysis contradicted what stakeholders wanted to hear.

Accepted Answer

Pick a real example where evidence challenged a preferred plan. Explain what stakeholders expected, what data you analyzed, and why the conclusion differed. Be specific about validation because credibility matters in these situations.

Then describe how you communicated the finding. Strong answers show diplomacy: acknowledge the stakeholder goal, present evidence clearly, explain uncertainty, and offer alternatives. The goal is not to win an argument; it is to help the team make a better decision.

Close with the outcome. Maybe the team delayed launch, changed targeting, ran a smaller test, or chose a different metric. If stakeholders ignored the recommendation, explain what happened and what you learned about influence and communication.

Data Scientist Interview Questions

Statistics and Probability Questions

Experimentation and Causal Inference

Machine Learning Interview Questions

Model Evaluation and Metrics

SQL and Python Questions

Product Analytics and Business Case Questions

Modeling Case Studies

Behavioral and Communication Questions

Practice these answers live