How would you deduplicate an events table and keep the latest event per event_id?

Use a window function to assign a row number within each event_id partition. Order by the most reliable freshness column descending, such as ingestion_timestamp or updated_at. If ties are possible, add a deterministic tie-breaker like source_sequence_number or raw_file_name. The query pattern is: select all rows, calculate row_number() over(partition by event_id order by ingestion_timestamp desc, source_sequence_number desc), then filter where row_number = 1. The important interview detail is determinism. If two records have the same timestamp and no tie-breaker, the result may be unstable between runs. Also clarify whether the latest event means latest ingestion time, latest event time, or latest source update time. For production, I would monitor duplicate rate, late-arriving events, null event_id count, and whether deduplication changes downstream metrics unexpectedly.

Write SQL to build a daily active users table from raw events.

First define active user. Usually, not every event should count. Page views, bot events, background refreshes, and internal users may need exclusion. Choose meaningful events such as session_start, purchase, message_sent, or feature_used depending on the product. Create a transformation that filters to eligible events, converts event timestamps to the business timezone, groups by activity_date and user_id, and outputs one row per active user per day. If the final table is an aggregate, count distinct users by day. If the table is a serving model, store date, user_id, platform, country, and first_active_timestamp. Important production details include timezone, late-arriving events, bot/test filtering, deleted users, duplicate events, and partitioning by activity_date. For incremental builds, process recent partitions again to capture late events, not only the current day. A strong answer also includes validation: compare DAU against source event counts, monitor day-over-day changes, and check segment-level consistency.

A query that used to run in 5 minutes now takes 2 hours. How do you debug it?

First check whether the issue is query-specific, warehouse-wide, or infrastructure-related. If many queries slowed down, it may be cluster capacity, warehouse load, or service degradation. If one query slowed down, inspect the query and data. Look at recent changes: table size growth, schema changes, new joins, changed filters, missing partition pruning, stale statistics, data skew, exploding joins, or added distinct/order by operations. Review the query plan if available. Common causes include joining at the wrong grain, missing date filters, many-to-many joins, scanning all partitions, inefficient window functions, large shuffles, or using functions on partition columns that prevent pruning. Fixes may include filtering earlier, pre-aggregating, adding partition filters, clustering/sorting, materializing intermediate models, updating statistics, reducing selected columns, or correcting join logic. I would also add monitoring for runtime, bytes scanned, row counts, and cost.

Design a data warehouse model for an ecommerce company.

Start with core business processes: browsing, cart, orders, payments, shipments, returns, refunds, inventory, and marketing attribution. Each process may become a fact table or event model depending on usage. Important fact tables include fact_orders at one row per order, fact_order_items at one row per product per order, fact_payments, fact_refunds, fact_shipments, and fact_inventory_snapshot. Dimensions include dim_customer, dim_product, dim_date, dim_store or warehouse, dim_channel, and dim_campaign. Declare grain explicitly. Revenue by product category should usually come from order item grain, not order grain. Customer lifecycle metrics may use customer or subscription grain. Inventory analysis may need daily snapshot grain. Model considerations: slowly changing customer or product attributes, refunds and cancellations, discounts at order versus item level, currency, taxes, shipping, guest checkout, and late-arriving data. I would define source-of-truth metrics and build semantic models or marts for finance, marketing, product, and operations. Validation includes reconciling revenue to finance, order counts to transactional systems, refund totals to payments, and inventory to warehouse systems.

How would you model slowly changing dimensions?

The right approach depends on whether historical accuracy matters. Type 1 overwrites old values, which is simple and useful when only the latest value matters. Type 2 preserves history by creating a new dimension row for each attribute change, usually with effective_start_date, effective_end_date, current_flag, and a surrogate key. Example: if a customer changes region from West to East, Type 1 would update the customer row to East. Historical revenue reports would now show old revenue under East, which may be wrong. Type 2 would preserve the old West row and create a new East row, allowing facts to join to the correct version at event time. Type 2 adds complexity: surrogate keys, date range joins, late-arriving facts, current versus historical reporting, and larger dimension tables. Use it when reporting must reflect attributes as they were at the time of the event. A strong answer explains the business need before choosing the SCD type.

Star schema versus wide denormalized table: when would you use each?

A star schema separates facts and dimensions. It is useful when you need reusable dimensions, consistent metrics, governance, clear grain, and flexible analysis. It reduces duplication and makes changes to dimensions easier to manage. A wide denormalized table can be useful for specific high-usage analytics, machine learning features, or dashboards where query simplicity and performance matter. It can reduce join complexity for consumers but may duplicate data and create governance risk if many wide tables define metrics differently. In practice, I would often maintain clean canonical facts and dimensions, then publish curated marts or wide tables for specific use cases. The warehouse foundation should remain trustworthy, while downstream models optimize for user needs. The decision depends on scale, consumers, BI tool behavior, cost, latency, and how often business definitions change.

Design a daily ETL pipeline that ingests orders from a production database into a warehouse.

Clarify requirements first: data freshness, order volume, acceptable lag, source database load constraints, downstream consumers, and whether historical corrections occur. A simple design: extract orders incrementally using updated_at or CDC, land raw data in object storage or a raw warehouse table, transform into staged tables, then publish modeled facts such as fact_orders and fact_order_items. Orchestrate with Airflow, Dagster, dbt Cloud, or a managed workflow tool. The pipeline should be idempotent. If it reruns for a date or batch, it should replace or merge data deterministically rather than append duplicates. Use watermarks carefully, and reprocess a lookback window to capture late updates. Validation checks: row counts, null primary keys, duplicate order IDs, revenue totals, status distribution, freshness, and reconciliation to source. Monitoring should alert on failures, unusual volume changes, high duplicate rates, or delayed data. Document ownership and downstream dependencies.

What does it mean for a pipeline to be idempotent?

An idempotent pipeline can be run multiple times with the same input and produce the same correct output without duplicates or unintended side effects. This matters because pipelines fail, get retried, and require backfills. Example: a non-idempotent pipeline appends yesterday's orders every time it runs. If it is retried after a failure, revenue doubles. An idempotent version would delete and replace the target partition, merge by primary key, or write to a temporary table then atomically swap. Idempotency requires clear keys, partition strategy, deterministic transformations, and careful handling of side effects. It is one of the most important production data engineering concepts because recoverability depends on it.

How would you handle schema changes from an upstream source?

First detect schema changes automatically through schema registry, metadata checks, contract tests, or ingestion validation. Then classify the change: additive column, removed column, renamed column, type change, nullability change, or semantic change. Additive nullable columns are usually safe. Removed columns, renamed fields, and type changes can break downstream transformations and dashboards. Semantic changes are especially dangerous because the schema may look valid while meaning changes. Protect the pipeline with raw data preservation, compatibility checks, alerts, and contracts for critical sources. For breaking changes, coordinate with source owners, update transformations, backfill if needed, and communicate downstream impact. A mature answer includes data contracts, versioning, lineage, tests, and clear ownership so schema changes do not silently corrupt analytics.

When would you choose streaming instead of batch?

Choose streaming when low latency creates meaningful value. Examples include fraud detection, operational alerts, real-time personalization, inventory updates, live experimentation metrics, and user-facing product features that need fresh events. Batch is often better for daily reporting, historical analytics, finance reconciliation, and transformations where latency of hours is acceptable. Batch is simpler to operate, easier to replay, and often cheaper. The decision should consider freshness requirements, event volume, ordering needs, stateful processing, fault tolerance, team expertise, cost, and downstream consumers. Streaming is not automatically better. It adds complexity around duplicates, late events, checkpoints, schema evolution, and monitoring. A strong answer says: use the simplest architecture that meets the business requirement. If the business only needs daily metrics, do not build a streaming system for prestige.

Design a real-time fraud event pipeline.

Start with requirements: event volume, latency target, fraud decision type, tolerance for false positives, and whether decisions block transactions or trigger review. A possible design: payment events are published to Kafka or Kinesis. A stream processor consumes events, validates schema, deduplicates by event_id, enriches with user, device, merchant, and historical velocity features, then applies rules or a model. High-risk events trigger a decision service, manual review queue, or step-up authentication. Raw events and decisions are written to durable storage for audit and model training. Key engineering concerns: duplicate events, out-of-order events, feature freshness, state store size, exactly-once or effectively-once processing, backpressure, dead-letter queues, replay, latency monitoring, and model/rule versioning. Data quality and governance matter because fraud decisions affect customers. Log reason codes, monitor false positive rates, and support rollback if a rule or model behaves badly.

What is a shuffle in Spark and why is it expensive?

A shuffle happens when Spark needs to redistribute data across partitions, usually for operations like groupBy, join, distinct, orderBy, and repartition. It is expensive because data must move across the network, be serialized/deserialized, written to disk, and coordinated across executors. Shuffles often dominate job runtime and can cause failures when data is skewed or intermediate data is too large. A single hot key can send too much data to one partition, creating stragglers or out-of-memory errors. Ways to reduce shuffle cost include filtering early, selecting only needed columns, pre-aggregating, using broadcast joins for small tables, partitioning by join keys, salting skewed keys, avoiding unnecessary distinct/orderBy, and tuning partition counts. A strong answer connects the concept to practical debugging: inspect the Spark UI, stages, shuffle read/write, skew, spill, task duration, and executor memory.

How would you optimize a Spark job that fails with out-of-memory errors?

First identify where the failure occurs using Spark UI or logs: which stage, operation, partition, and executor. OOM can come from skew, large joins, collect operations, oversized partitions, inefficient UDFs, or caching too much data. Then reduce data early: filter rows, select only necessary columns, push predicates, and avoid reading unnecessary partitions. Check joins: if one table is small, use broadcast join. If a key is skewed, use salting or split heavy keys. If partitions are too large, increase partition count; if too many tiny partitions, coalesce appropriately. Avoid collect() on large data, avoid Python UDFs when native functions work, and cache only when reused and when memory allows. For file output, avoid too many small files and choose columnar formats like Parquet with compression. If tuning resources is needed, adjust executor memory, cores, shuffle partitions, and memory overhead, but do not rely only on bigger clusters. Fixing data shape and query plan is usually more durable.

Parquet versus CSV: why do data engineers prefer Parquet for analytics?

Parquet is a columnar file format, which is efficient for analytics because queries often read only a subset of columns. It supports compression, encodes data types, stores schema, and enables predicate pushdown and column pruning in many engines. CSV is row-oriented plain text. It is human-readable and portable, but it lacks strong types, can be larger, slower to scan, more error-prone with delimiters/escaping, and cannot skip irrelevant columns efficiently. For data lakes and warehouses, Parquet usually reduces storage cost and improves query performance. CSV can still be useful for simple exports, small files, or interoperability, but it is not ideal as the main analytics storage format.

What data quality checks would you add to a critical revenue pipeline?

For a revenue pipeline, I would add checks for freshness, row count, null primary keys, duplicate transaction IDs, valid statuses, valid currencies, non-negative amounts where applicable, accepted date ranges, and referential integrity between orders, payments, refunds, and customers. I would also add reconciliation checks: total revenue compared with source payment systems, finance reports, or previous pipeline version. Monitor day-over-day and week-over-week changes with thresholds, but account for seasonality and known events. Business rules matter: canceled orders should not count as revenue, refunds should reduce net revenue, test transactions should be excluded, and currency conversion should use the correct rate. Alerts should be actionable, routed to an owner, and include context: failed check, affected table, severity, downstream dependencies, and suggested runbook. Too many noisy alerts will be ignored.

A dashboard number changed unexpectedly after a pipeline deployment. What do you do?

First assess impact: which dashboard, metric, users, date range, and decisions are affected. Determine whether the number is wrong or whether the deployment corrected a previous issue. Compare the new output to the previous version by table, partition, row count, distinct count, metric totals, and key segments. Use lineage to identify upstream changes. Inspect code changes, schema changes, filters, joins, deduplication, and date logic. If the metric is wrong and business users are affected, mitigate quickly: roll back the transformation, restore previous table version, disable the dashboard, or add a warning. Then perform root-cause analysis and backfill corrected data if needed. Communication matters. Notify stakeholders with what changed, what is affected, confidence level, expected fix time, and whether past decisions need review.

Design a data platform for product analytics events.

Start with requirements: event volume, latency, consumers, retention, schema evolution, privacy, and reliability. Product analytics events may power dashboards, experiments, personalization, and machine learning features. Clients emit events through an SDK with event_name, user_id, session_id, timestamp, properties, app version, platform, and anonymous ID. Events go to an ingestion API or event collector, then into a durable stream like Kafka/Kinesis/Pub/Sub. Raw events are written to object storage for replay and to a warehouse or lakehouse for analytics. Processing includes schema validation, deduplication, bot/internal filtering, sessionization, identity resolution, and transformation into modeled tables such as fact_events, fact_sessions, dim_users, and product-specific marts. Batch and streaming paths may coexist depending on freshness needs. Quality and governance: event contracts, schema registry, PII handling, consent enforcement, lineage, ownership, freshness checks, volume anomaly alerts, and documentation. Serving layers include BI dashboards, experimentation analysis, reverse ETL, and feature stores. Tradeoffs include real-time versus batch cost, strict versus flexible schema, raw event retention, identity complexity, and how much logic belongs in ingestion versus transformation.

Design a feature store for machine learning teams.

A feature store should provide reusable, reliable features for training and serving. Requirements include feature definitions, point-in-time correctness, offline training data, online low-latency serving, freshness, access control, and monitoring. Offline features can be stored in a warehouse or lakehouse, partitioned by date and entity. Online features can be served from a low-latency key-value store. A feature registry stores metadata: feature name, owner, entity key, transformation logic, freshness SLA, source tables, and description. Point-in-time correctness is critical. Training data must use feature values available at prediction time, not future values. For online/offline consistency, the same transformation logic should be reused when possible, or thoroughly tested if separate pipelines exist. Monitoring should cover feature freshness, null rates, distribution drift, online serving latency, training-serving skew, and downstream model impact. Governance includes PII controls, lineage, deprecation, and ownership.

Tell me about a data pipeline incident you handled.

Choose an incident with real impact: missing data, wrong metrics, delayed pipeline, duplicate records, broken dashboard, model feature issue, or downstream product impact. Start with who was affected and why it mattered. Then explain your triage process: logs, orchestration status, source freshness, recent deployments, schema changes, row counts, partitions, lineage, and downstream dependencies. Describe how you mitigated the issue: rerun, rollback, patch, backfill, or stakeholder warning. The strongest answer includes prevention. Did you add data quality checks, alerts, retries, schema contracts, idempotent writes, runbooks, lineage, or better deployment controls? Avoid blaming upstream teams. Show ownership of the system and communication. Close with what you learned about reliability and how your process changed afterward.

How do you work with analysts who say the data is wrong?

First clarify exactly what looks wrong: metric, table, date range, dashboard, filter, expected value, and business reason. “The data is wrong” needs to become a reproducible issue. Then compare the analyst query or dashboard to source tables and model definitions. Check grain, filters, joins, timezones, freshness, metric definitions, and recent changes. Use lineage to trace upstream dependencies. If the data is wrong, fix it and communicate impact. If the data is correct but the definition differs from expectation, align on metric definition and documentation. If uncertainty remains, explain what is known, what is being checked, and when you will follow up. The relationship matters. Analysts are data platform customers, and trust is built through fast, clear, and evidence-based responses.

How do you prioritize data engineering work?

I prioritize based on business impact, reliability risk, urgency, downstream dependencies, and leverage. A broken revenue pipeline or compliance issue usually outranks a nice-to-have dashboard model. A platform improvement may outrank a one-off request if it unblocks many teams. I also distinguish between urgent incidents, strategic platform work, stakeholder requests, technical debt, and operational toil. If the team only handles tickets, reliability and platform quality decay. If the team only builds platform abstractions, business teams may be blocked. A strong prioritization process includes ownership, SLAs, severity levels, roadmap alignment, and transparent tradeoffs. I would communicate what is being done, what is deferred, why, and what risk that creates.

Interview Guide

Data Engineer Interview Guide

Prepare for data engineer interviews with SQL, data modeling, ETL pipelines, batch and streaming systems, Spark, orchestration, warehouses, data quality, and system design questions.

35 min read

21 questions

Data Engineer

Updated May 2026

View all data engineer questions

Overview

Data engineer interviews test whether you can build reliable data systems: ingest raw data, model it clearly, transform it efficiently, enforce quality, orchestrate pipelines, and make data available to analysts, scientists, products, and business teams.

4-6

Typical interview rounds

45-75 min

Technical round length

Core DE skill areas

5-8 wks

Recommended prep window

What data engineer interviewers are evaluating

—

SQL depth: can you write accurate, performant queries across joins, windows, deduplication, and incremental logic?

—

Data modeling: can you design tables, schemas, grains, partitions, and slowly changing dimensions for real use cases?

—

Pipeline design: can you build batch and streaming workflows that are reliable, observable, and recoverable?

—

Distributed systems judgment: can you reason about Spark, partitioning, shuffles, state, latency, and throughput?

—

Data quality: can you prevent bad data from silently reaching dashboards, models, and production features?

—

Platform thinking: can you balance cost, performance, freshness, governance, lineage, and developer experience?

—

Production ownership: can you debug incidents, backfill safely, manage schema changes, and communicate impact?

Data engineering is reliability engineering for data

Strong data engineers do not only move data from A to B. They make data trustworthy, understandable, discoverable, scalable, and recoverable when something breaks.

Data Engineer Interview Process

Data engineering loops usually include SQL, Python or coding, data modeling, pipeline design, system design, Spark or distributed processing, debugging, and behavioral interviews.

Typical data engineer interview stages

Recruiter screen: confirms role fit, stack, compensation, location, and domain experience.

Hiring manager screen: covers pipeline ownership, data modeling experience, production incidents, and collaboration with analytics or product teams.

SQL round: tests joins, windows, deduplication, incremental transformations, cohort-style queries, and performance awareness.

Coding round: often Python-focused, testing data structures, file processing, APIs, parsing, or pipeline-style transformations.

Data modeling round: asks you to design warehouse tables, event schemas, fact/dimension models, or lakehouse layouts.

System design round: asks you to design ingestion, ETL/ELT, streaming, orchestration, monitoring, lineage, and quality systems.

Behavioral round: evaluates ownership, incident response, stakeholder communication, ambiguity, and cross-functional delivery.

	Analytics Data Engineer	Platform / Streaming Data Engineer
Primary focus	Warehouse modeling, dbt/SQL transformations, reporting data quality, business metrics	Ingestion infrastructure, streaming, distributed processing, reliability, platform scale
Common interviews	SQL, dimensional modeling, stakeholder requirements, pipeline orchestration	System design, Spark/Flink/Kafka, state, partitioning, scaling, operational incidents
Strong signal	Builds trusted models that analysts and business teams can use confidently	Designs resilient systems that handle high volume, low latency, and failures
Common mistake	Modeling tables without clear grain, ownership, or metric definitions	Designing streaming architecture without exactly-once, replay, state, or monitoring considerations

Know which data engineering role you are targeting

A warehouse-focused analytics engineer interview is different from a streaming platform data engineer interview. Tailor preparation to the stack and responsibilities in the job description.

SQL and Data Transformation Questions

Data engineer SQL questions are usually about correctness and production-readiness: grain, deduplication, incremental logic, window functions, partitions, and performance.

Data Modeling and Warehousing Questions

Data modeling interviews test whether you can design schemas that are clear, scalable, cost-aware, and useful for analytics, machine learning, and operational reporting.

Data modeling concepts to know

Fact table

A table containing measurable business events or transactions, such as orders, payments, sessions, or shipments.

Dimension table

A table containing descriptive context for facts, such as customers, products, stores, campaigns, or dates.

Grain

The level represented by each row. Declaring grain is essential before building facts, dimensions, metrics, or joins.

Slowly changing dimension

A dimension design that tracks how attributes change over time, such as customer segment, address, or account owner.

Pipeline Design and Orchestration

Pipeline interviews test whether you can design workflows that are idempotent, observable, recoverable, and appropriate for freshness and cost requirements.

A reliable pipeline design flow

Clarify source systems, data volume, freshness needs, downstream consumers, and failure tolerance.

Choose ingestion pattern: batch extract, CDC, event stream, API pull, file drop, or managed connector.

Define landing zone, raw storage, schema handling, and replay strategy.

Transform data through clear layers: raw, cleaned/staged, modeled, and serving marts.

Make pipelines idempotent so reruns do not duplicate or corrupt data.

Add data quality checks, lineage, logging, alerts, metrics, and owner information.

Plan for backfills, schema evolution, late data, retries, partial failures, and cost controls.

Batch and Streaming Systems

Batch and streaming questions test whether you understand latency, throughput, ordering, state, windowing, replay, and the operational tradeoffs between simpler and more real-time architectures.

	Batch Processing	Streaming Processing
Best for	Periodic reporting, historical backfills, large transformations, cost-efficient analytics	Real-time features, alerts, fraud detection, live dashboards, low-latency decisions
Main tradeoff	Higher latency but simpler operations and easier replay	Lower latency but more complexity around state, ordering, and failure handling
Common tools	Airflow, dbt, Spark batch, Snowflake, BigQuery, Databricks	Kafka, Flink, Spark Structured Streaming, Kinesis, Pub/Sub
Failure concern	Late data, partial loads, long runtimes, backfill cost	Duplicates, out-of-order events, checkpointing, state growth, exactly-once semantics

Spark and Distributed Processing Questions

Spark and distributed processing questions test whether you understand partitioning, shuffles, skew, caching, joins, file formats, and why jobs fail or become expensive.

Data Quality and Observability

Data quality questions test whether you can catch bad data before it breaks dashboards, models, product features, finance reports, or customer-facing systems.

Worked Example

Production data incident checklist

The executive revenue dashboard is missing yesterday's data two hours before a leadership meeting.

Triage

Check pipeline status, source freshness, warehouse table partitions, failed tasks, and whether the issue affects only revenue or all dashboards.

Mitigate

If source data exists, rerun the affected partition. If not, annotate the dashboard and provide the latest available number with caveat.

Communicate

Tell stakeholders what is missing, what decisions are affected, expected resolution time, and whether numbers may change.

Prevent

Add freshness alerts, upstream source checks, SLA monitoring, and a runbook for revenue pipeline failures.

Result

The response protects trust by combining technical recovery with clear stakeholder communication.

Data Engineering System Design

Data engineering system design interviews evaluate architecture judgment: sources, ingestion, storage, transformation, serving, quality, lineage, governance, and cost.

Behavioral and Collaboration Questions

Behavioral data engineering interviews focus on production ownership, incident response, cross-functional communication, prioritization, and building systems other teams can trust.

Data Engineer Prep Strategy

Data engineer prep should combine SQL, data modeling, Python, pipeline design, distributed systems, cloud warehouse concepts, and production incident storytelling.

6-week data engineer interview prep plan

Week 1: SQL depth. Practice joins, windows, deduplication, incremental models, cohorts, partitions, and performance debugging.

Week 2: data modeling. Practice facts, dimensions, grains, SCDs, event models, warehouse marts, and metric definitions.

Week 3: pipelines. Practice ETL/ELT design, idempotency, orchestration, backfills, late data, schema changes, and data quality checks.

Week 4: distributed processing. Review Spark, shuffles, partitioning, skew, file formats, streaming concepts, and cost/performance tradeoffs.

Week 5: system design. Practice product analytics platforms, CDC ingestion, feature stores, streaming fraud pipelines, and warehouse architecture.

Week 6: mock interviews and stories. Prepare pipeline incident, data quality, stakeholder, prioritization, and platform improvement examples.

Role-specific prep by data engineering track

—

Analytics engineering: focus on dbt, SQL models, semantic layers, metric definitions, BI reliability, and stakeholder workflows.

—

Platform data engineering: focus on ingestion systems, orchestration, lineage, governance, access controls, and developer experience.

—

Streaming data engineering: focus on Kafka, Flink/Spark streaming, state, windows, duplicates, ordering, replay, and latency.

—

ML data engineering: focus on feature pipelines, feature stores, point-in-time correctness, training data, and monitoring.

—

Cloud warehouse engineering: focus on Snowflake, BigQuery, Databricks, partitioning, clustering, cost controls, and workload management.

Do not talk only about tools

Interviewers care less that you know Airflow, dbt, Spark, or Kafka by name and more that you understand reliability, data correctness, tradeoffs, and failure modes.

Key Takeaway

Great data engineer interview answers combine SQL correctness, data modeling clarity, pipeline reliability, distributed systems judgment, and production ownership. The best candidates show they can build data systems other teams can trust.

Practice these questions live

Interview Pilot gives you real-time Interview Copilot answer suggestions during live interviews, so you can respond clearly when Data Engineer questions come up.

Try Interview Pilot free