Question 1

How would you deduplicate an events table and keep the latest event per event_id?

Accepted Answer

Use a window function to assign a row number within each event_id partition. Order by the most reliable freshness column descending, such as ingestion_timestamp or updated_at. If ties are possible, add a deterministic tie-breaker like source_sequence_number or raw_file_name.

The query pattern is: select all rows, calculate row_number() over(partition by event_id order by ingestion_timestamp desc, source_sequence_number desc), then filter where row_number = 1.

The important interview detail is determinism. If two records have the same timestamp and no tie-breaker, the result may be unstable between runs. Also clarify whether the latest event means latest ingestion time, latest event time, or latest source update time.

For production, I would monitor duplicate rate, late-arriving events, null event_id count, and whether deduplication changes downstream metrics unexpectedly.

Question 2

Write SQL to build a daily active users table from raw events.

Accepted Answer

First define active user. Usually, not every event should count. Page views, bot events, background refreshes, and internal users may need exclusion. Choose meaningful events such as session_start, purchase, message_sent, or feature_used depending on the product.

Create a transformation that filters to eligible events, converts event timestamps to the business timezone, groups by activity_date and user_id, and outputs one row per active user per day. If the final table is an aggregate, count distinct users by day. If the table is a serving model, store date, user_id, platform, country, and first_active_timestamp.

Important production details include timezone, late-arriving events, bot/test filtering, deleted users, duplicate events, and partitioning by activity_date. For incremental builds, process recent partitions again to capture late events, not only the current day.

A strong answer also includes validation: compare DAU against source event counts, monitor day-over-day changes, and check segment-level consistency.

Question 3

A query that used to run in 5 minutes now takes 2 hours. How do you debug it?

Accepted Answer

First check whether the issue is query-specific, warehouse-wide, or infrastructure-related. If many queries slowed down, it may be cluster capacity, warehouse load, or service degradation. If one query slowed down, inspect the query and data.

Look at recent changes: table size growth, schema changes, new joins, changed filters, missing partition pruning, stale statistics, data skew, exploding joins, or added distinct/order by operations. Review the query plan if available.

Common causes include joining at the wrong grain, missing date filters, many-to-many joins, scanning all partitions, inefficient window functions, large shuffles, or using functions on partition columns that prevent pruning.

Fixes may include filtering earlier, pre-aggregating, adding partition filters, clustering/sorting, materializing intermediate models, updating statistics, reducing selected columns, or correcting join logic. I would also add monitoring for runtime, bytes scanned, row counts, and cost.

Question 4

Design a data warehouse model for an ecommerce company.

Accepted Answer

Start with core business processes: browsing, cart, orders, payments, shipments, returns, refunds, inventory, and marketing attribution. Each process may become a fact table or event model depending on usage.

Important fact tables include fact_orders at one row per order, fact_order_items at one row per product per order, fact_payments, fact_refunds, fact_shipments, and fact_inventory_snapshot. Dimensions include dim_customer, dim_product, dim_date, dim_store or warehouse, dim_channel, and dim_campaign.

Declare grain explicitly. Revenue by product category should usually come from order item grain, not order grain. Customer lifecycle metrics may use customer or subscription grain. Inventory analysis may need daily snapshot grain.

Model considerations: slowly changing customer or product attributes, refunds and cancellations, discounts at order versus item level, currency, taxes, shipping, guest checkout, and late-arriving data. I would define source-of-truth metrics and build semantic models or marts for finance, marketing, product, and operations.

Validation includes reconciling revenue to finance, order counts to transactional systems, refund totals to payments, and inventory to warehouse systems.

Question 5

How would you model slowly changing dimensions?

Accepted Answer

The right approach depends on whether historical accuracy matters. Type 1 overwrites old values, which is simple and useful when only the latest value matters. Type 2 preserves history by creating a new dimension row for each attribute change, usually with effective_start_date, effective_end_date, current_flag, and a surrogate key.

Example: if a customer changes region from West to East, Type 1 would update the customer row to East. Historical revenue reports would now show old revenue under East, which may be wrong. Type 2 would preserve the old West row and create a new East row, allowing facts to join to the correct version at event time.

Type 2 adds complexity: surrogate keys, date range joins, late-arriving facts, current versus historical reporting, and larger dimension tables. Use it when reporting must reflect attributes as they were at the time of the event.

A strong answer explains the business need before choosing the SCD type.

Question 6

Star schema versus wide denormalized table: when would you use each?

Accepted Answer

A star schema separates facts and dimensions. It is useful when you need reusable dimensions, consistent metrics, governance, clear grain, and flexible analysis. It reduces duplication and makes changes to dimensions easier to manage.

A wide denormalized table can be useful for specific high-usage analytics, machine learning features, or dashboards where query simplicity and performance matter. It can reduce join complexity for consumers but may duplicate data and create governance risk if many wide tables define metrics differently.

In practice, I would often maintain clean canonical facts and dimensions, then publish curated marts or wide tables for specific use cases. The warehouse foundation should remain trustworthy, while downstream models optimize for user needs.

The decision depends on scale, consumers, BI tool behavior, cost, latency, and how often business definitions change.

Question 7

Design a daily ETL pipeline that ingests orders from a production database into a warehouse.

Accepted Answer

Clarify requirements first: data freshness, order volume, acceptable lag, source database load constraints, downstream consumers, and whether historical corrections occur.

A simple design: extract orders incrementally using updated_at or CDC, land raw data in object storage or a raw warehouse table, transform into staged tables, then publish modeled facts such as fact_orders and fact_order_items. Orchestrate with Airflow, Dagster, dbt Cloud, or a managed workflow tool.

The pipeline should be idempotent. If it reruns for a date or batch, it should replace or merge data deterministically rather than append duplicates. Use watermarks carefully, and reprocess a lookback window to capture late updates.

Validation checks: row counts, null primary keys, duplicate order IDs, revenue totals, status distribution, freshness, and reconciliation to source. Monitoring should alert on failures, unusual volume changes, high duplicate rates, or delayed data. Document ownership and downstream dependencies.

Question 8

What does it mean for a pipeline to be idempotent?

Accepted Answer

An idempotent pipeline can be run multiple times with the same input and produce the same correct output without duplicates or unintended side effects. This matters because pipelines fail, get retried, and require backfills.

Example: a non-idempotent pipeline appends yesterday's orders every time it runs. If it is retried after a failure, revenue doubles. An idempotent version would delete and replace the target partition, merge by primary key, or write to a temporary table then atomically swap.

Idempotency requires clear keys, partition strategy, deterministic transformations, and careful handling of side effects. It is one of the most important production data engineering concepts because recoverability depends on it.

Question 9

How would you handle schema changes from an upstream source?

Accepted Answer

First detect schema changes automatically through schema registry, metadata checks, contract tests, or ingestion validation. Then classify the change: additive column, removed column, renamed column, type change, nullability change, or semantic change.

Additive nullable columns are usually safe. Removed columns, renamed fields, and type changes can break downstream transformations and dashboards. Semantic changes are especially dangerous because the schema may look valid while meaning changes.

Protect the pipeline with raw data preservation, compatibility checks, alerts, and contracts for critical sources. For breaking changes, coordinate with source owners, update transformations, backfill if needed, and communicate downstream impact.

A mature answer includes data contracts, versioning, lineage, tests, and clear ownership so schema changes do not silently corrupt analytics.

Question 10

When would you choose streaming instead of batch?

Accepted Answer

Choose streaming when low latency creates meaningful value. Examples include fraud detection, operational alerts, real-time personalization, inventory updates, live experimentation metrics, and user-facing product features that need fresh events.

Batch is often better for daily reporting, historical analytics, finance reconciliation, and transformations where latency of hours is acceptable. Batch is simpler to operate, easier to replay, and often cheaper.

The decision should consider freshness requirements, event volume, ordering needs, stateful processing, fault tolerance, team expertise, cost, and downstream consumers. Streaming is not automatically better. It adds complexity around duplicates, late events, checkpoints, schema evolution, and monitoring.

A strong answer says: use the simplest architecture that meets the business requirement. If the business only needs daily metrics, do not build a streaming system for prestige.

Question 11

Design a real-time fraud event pipeline.

Accepted Answer

Start with requirements: event volume, latency target, fraud decision type, tolerance for false positives, and whether decisions block transactions or trigger review.

A possible design: payment events are published to Kafka or Kinesis. A stream processor consumes events, validates schema, deduplicates by event_id, enriches with user, device, merchant, and historical velocity features, then applies rules or a model. High-risk events trigger a decision service, manual review queue, or step-up authentication. Raw events and decisions are written to durable storage for audit and model training.

Key engineering concerns: duplicate events, out-of-order events, feature freshness, state store size, exactly-once or effectively-once processing, backpressure, dead-letter queues, replay, latency monitoring, and model/rule versioning.

Data quality and governance matter because fraud decisions affect customers. Log reason codes, monitor false positive rates, and support rollback if a rule or model behaves badly.

Question 12

What is a shuffle in Spark and why is it expensive?

Accepted Answer

A shuffle happens when Spark needs to redistribute data across partitions, usually for operations like groupBy, join, distinct, orderBy, and repartition. It is expensive because data must move across the network, be serialized/deserialized, written to disk, and coordinated across executors.

Shuffles often dominate job runtime and can cause failures when data is skewed or intermediate data is too large. A single hot key can send too much data to one partition, creating stragglers or out-of-memory errors.

Ways to reduce shuffle cost include filtering early, selecting only needed columns, pre-aggregating, using broadcast joins for small tables, partitioning by join keys, salting skewed keys, avoiding unnecessary distinct/orderBy, and tuning partition counts.

A strong answer connects the concept to practical debugging: inspect the Spark UI, stages, shuffle read/write, skew, spill, task duration, and executor memory.

Question 13

How would you optimize a Spark job that fails with out-of-memory errors?

Accepted Answer

First identify where the failure occurs using Spark UI or logs: which stage, operation, partition, and executor. OOM can come from skew, large joins, collect operations, oversized partitions, inefficient UDFs, or caching too much data.

Then reduce data early: filter rows, select only necessary columns, push predicates, and avoid reading unnecessary partitions. Check joins: if one table is small, use broadcast join. If a key is skewed, use salting or split heavy keys. If partitions are too large, increase partition count; if too many tiny partitions, coalesce appropriately.

Avoid collect() on large data, avoid Python UDFs when native functions work, and cache only when reused and when memory allows. For file output, avoid too many small files and choose columnar formats like Parquet with compression.

If tuning resources is needed, adjust executor memory, cores, shuffle partitions, and memory overhead, but do not rely only on bigger clusters. Fixing data shape and query plan is usually more durable.

Question 14

Parquet versus CSV: why do data engineers prefer Parquet for analytics?

Accepted Answer

Parquet is a columnar file format, which is efficient for analytics because queries often read only a subset of columns. It supports compression, encodes data types, stores schema, and enables predicate pushdown and column pruning in many engines.

CSV is row-oriented plain text. It is human-readable and portable, but it lacks strong types, can be larger, slower to scan, more error-prone with delimiters/escaping, and cannot skip irrelevant columns efficiently.

For data lakes and warehouses, Parquet usually reduces storage cost and improves query performance. CSV can still be useful for simple exports, small files, or interoperability, but it is not ideal as the main analytics storage format.

Question 15

What data quality checks would you add to a critical revenue pipeline?

Accepted Answer

For a revenue pipeline, I would add checks for freshness, row count, null primary keys, duplicate transaction IDs, valid statuses, valid currencies, non-negative amounts where applicable, accepted date ranges, and referential integrity between orders, payments, refunds, and customers.

I would also add reconciliation checks: total revenue compared with source payment systems, finance reports, or previous pipeline version. Monitor day-over-day and week-over-week changes with thresholds, but account for seasonality and known events.

Business rules matter: canceled orders should not count as revenue, refunds should reduce net revenue, test transactions should be excluded, and currency conversion should use the correct rate.

Alerts should be actionable, routed to an owner, and include context: failed check, affected table, severity, downstream dependencies, and suggested runbook. Too many noisy alerts will be ignored.

Question 16

A dashboard number changed unexpectedly after a pipeline deployment. What do you do?

Accepted Answer

First assess impact: which dashboard, metric, users, date range, and decisions are affected. Determine whether the number is wrong or whether the deployment corrected a previous issue.

Compare the new output to the previous version by table, partition, row count, distinct count, metric totals, and key segments. Use lineage to identify upstream changes. Inspect code changes, schema changes, filters, joins, deduplication, and date logic.

If the metric is wrong and business users are affected, mitigate quickly: roll back the transformation, restore previous table version, disable the dashboard, or add a warning. Then perform root-cause analysis and backfill corrected data if needed.

Communication matters. Notify stakeholders with what changed, what is affected, confidence level, expected fix time, and whether past decisions need review.

Question 17

Design a data platform for product analytics events.

Accepted Answer

Start with requirements: event volume, latency, consumers, retention, schema evolution, privacy, and reliability. Product analytics events may power dashboards, experiments, personalization, and machine learning features.

Clients emit events through an SDK with event_name, user_id, session_id, timestamp, properties, app version, platform, and anonymous ID. Events go to an ingestion API or event collector, then into a durable stream like Kafka/Kinesis/Pub/Sub. Raw events are written to object storage for replay and to a warehouse or lakehouse for analytics.

Processing includes schema validation, deduplication, bot/internal filtering, sessionization, identity resolution, and transformation into modeled tables such as fact_events, fact_sessions, dim_users, and product-specific marts. Batch and streaming paths may coexist depending on freshness needs.

Quality and governance: event contracts, schema registry, PII handling, consent enforcement, lineage, ownership, freshness checks, volume anomaly alerts, and documentation. Serving layers include BI dashboards, experimentation analysis, reverse ETL, and feature stores.

Tradeoffs include real-time versus batch cost, strict versus flexible schema, raw event retention, identity complexity, and how much logic belongs in ingestion versus transformation.

Question 18

Design a feature store for machine learning teams.

Accepted Answer

A feature store should provide reusable, reliable features for training and serving. Requirements include feature definitions, point-in-time correctness, offline training data, online low-latency serving, freshness, access control, and monitoring.

Offline features can be stored in a warehouse or lakehouse, partitioned by date and entity. Online features can be served from a low-latency key-value store. A feature registry stores metadata: feature name, owner, entity key, transformation logic, freshness SLA, source tables, and description.

Point-in-time correctness is critical. Training data must use feature values available at prediction time, not future values. For online/offline consistency, the same transformation logic should be reused when possible, or thoroughly tested if separate pipelines exist.

Monitoring should cover feature freshness, null rates, distribution drift, online serving latency, training-serving skew, and downstream model impact. Governance includes PII controls, lineage, deprecation, and ownership.

Question 19

Tell me about a data pipeline incident you handled.

Accepted Answer

Choose an incident with real impact: missing data, wrong metrics, delayed pipeline, duplicate records, broken dashboard, model feature issue, or downstream product impact. Start with who was affected and why it mattered.

Then explain your triage process: logs, orchestration status, source freshness, recent deployments, schema changes, row counts, partitions, lineage, and downstream dependencies. Describe how you mitigated the issue: rerun, rollback, patch, backfill, or stakeholder warning.

The strongest answer includes prevention. Did you add data quality checks, alerts, retries, schema contracts, idempotent writes, runbooks, lineage, or better deployment controls? Avoid blaming upstream teams. Show ownership of the system and communication.

Close with what you learned about reliability and how your process changed afterward.

Question 20

How do you work with analysts who say the data is wrong?

Accepted Answer

First clarify exactly what looks wrong: metric, table, date range, dashboard, filter, expected value, and business reason. “The data is wrong” needs to become a reproducible issue.

Then compare the analyst query or dashboard to source tables and model definitions. Check grain, filters, joins, timezones, freshness, metric definitions, and recent changes. Use lineage to trace upstream dependencies.

If the data is wrong, fix it and communicate impact. If the data is correct but the definition differs from expectation, align on metric definition and documentation. If uncertainty remains, explain what is known, what is being checked, and when you will follow up.

The relationship matters. Analysts are data platform customers, and trust is built through fast, clear, and evidence-based responses.

Question 21

How do you prioritize data engineering work?

Accepted Answer

I prioritize based on business impact, reliability risk, urgency, downstream dependencies, and leverage. A broken revenue pipeline or compliance issue usually outranks a nice-to-have dashboard model. A platform improvement may outrank a one-off request if it unblocks many teams.

I also distinguish between urgent incidents, strategic platform work, stakeholder requests, technical debt, and operational toil. If the team only handles tickets, reliability and platform quality decay. If the team only builds platform abstractions, business teams may be blocked.

A strong prioritization process includes ownership, SLAs, severity levels, roadmap alignment, and transparent tradeoffs. I would communicate what is being done, what is deferred, why, and what risk that creates.

Data Engineer Interview Questions

SQL and Data Transformation Questions

Data Modeling and Warehousing Questions

Pipeline Design and Orchestration

Batch and Streaming Systems

Spark and Distributed Processing Questions

Data Quality and Observability

Data Engineering System Design

Behavioral and Collaboration Questions

Practice these answers live