Prepare for data engineer interviews with SQL, data modeling, ETL pipelines, batch and streaming systems, Spark, orchestration, warehouses, data quality, and system design questions.
Data engineer interviews test whether you can build reliable data systems: ingest raw data, model it clearly, transform it efficiently, enforce quality, orchestrate pipelines, and make data available to analysts, scientists, products, and business teams.
4-6
Typical interview rounds
45-75 min
Technical round length
6+
Core DE skill areas
5-8 wks
Recommended prep window
What data engineer interviewers are evaluating
—
SQL depth: can you write accurate, performant queries across joins, windows, deduplication, and incremental logic?
—
Data modeling: can you design tables, schemas, grains, partitions, and slowly changing dimensions for real use cases?
—
Pipeline design: can you build batch and streaming workflows that are reliable, observable, and recoverable?
—
Distributed systems judgment: can you reason about Spark, partitioning, shuffles, state, latency, and throughput?
—
Data quality: can you prevent bad data from silently reaching dashboards, models, and production features?
—
Platform thinking: can you balance cost, performance, freshness, governance, lineage, and developer experience?
—
Production ownership: can you debug incidents, backfill safely, manage schema changes, and communicate impact?
Data engineering is reliability engineering for data
Strong data engineers do not only move data from A to B. They make data trustworthy, understandable, discoverable, scalable, and recoverable when something breaks.
Data Engineer Interview Process
Data engineering loops usually include SQL, Python or coding, data modeling, pipeline design, system design, Spark or distributed processing, debugging, and behavioral interviews.
Typical data engineer interview stages
1
Recruiter screen: confirms role fit, stack, compensation, location, and domain experience.
2
Hiring manager screen: covers pipeline ownership, data modeling experience, production incidents, and collaboration with analytics or product teams.
System design, Spark/Flink/Kafka, state, partitioning, scaling, operational incidents
Strong signal
Builds trusted models that analysts and business teams can use confidently
Designs resilient systems that handle high volume, low latency, and failures
Common mistake
Modeling tables without clear grain, ownership, or metric definitions
Designing streaming architecture without exactly-once, replay, state, or monitoring considerations
Know which data engineering role you are targeting
A warehouse-focused analytics engineer interview is different from a streaming platform data engineer interview. Tailor preparation to the stack and responsibilities in the job description.
SQL and Data Transformation Questions
Data engineer SQL questions are usually about correctness and production-readiness: grain, deduplication, incremental logic, window functions, partitions, and performance.
Framework — Window function with deterministic ordering
Use a window function to assign a row number within each event_id partition. Order by the most reliable freshness column descending, such as ingestion_timestamp or updated_at. If ties are possible, add a deterministic tie-breaker like source_sequence_number or raw_file_name.
The query pattern is: select all rows, calculate row_number() over(partition by event_id order by ingestion_timestamp desc, source_sequence_number desc), then filter where row_number = 1.
The important interview detail is determinism. If two records have the same timestamp and no tie-breaker, the result may be unstable between runs. Also clarify whether the latest event means latest ingestion time, latest event time, or latest source update time.
For production, I would monitor duplicate rate, late-arriving events, null event_id count, and whether deduplication changes downstream metrics unexpectedly.
Likely follow-ups
What if event_id is missing?
How would you handle late-arriving corrections?
How would this change for incremental processing?
Framework — Filter active events -> normalize date -> group by date and user
First define active user. Usually, not every event should count. Page views, bot events, background refreshes, and internal users may need exclusion. Choose meaningful events such as session_start, purchase, message_sent, or feature_used depending on the product.
Create a transformation that filters to eligible events, converts event timestamps to the business timezone, groups by activity_date and user_id, and outputs one row per active user per day. If the final table is an aggregate, count distinct users by day. If the table is a serving model, store date, user_id, platform, country, and first_active_timestamp.
Important production details include timezone, late-arriving events, bot/test filtering, deleted users, duplicate events, and partitioning by activity_date. For incremental builds, process recent partitions again to capture late events, not only the current day.
A strong answer also includes validation: compare DAU against source event counts, monitor day-over-day changes, and check segment-level consistency.
Likely follow-ups
Which events should count as active?
How would you handle late events?
Should DAU be a table or a metric query?
Framework — Scope -> query plan -> data growth -> joins -> partitions -> recent changes
First check whether the issue is query-specific, warehouse-wide, or infrastructure-related. If many queries slowed down, it may be cluster capacity, warehouse load, or service degradation. If one query slowed down, inspect the query and data.
Look at recent changes: table size growth, schema changes, new joins, changed filters, missing partition pruning, stale statistics, data skew, exploding joins, or added distinct/order by operations. Review the query plan if available.
Common causes include joining at the wrong grain, missing date filters, many-to-many joins, scanning all partitions, inefficient window functions, large shuffles, or using functions on partition columns that prevent pruning.
Fixes may include filtering earlier, pre-aggregating, adding partition filters, clustering/sorting, materializing intermediate models, updating statistics, reducing selected columns, or correcting join logic. I would also add monitoring for runtime, bytes scanned, row counts, and cost.
Likely follow-ups
What is partition pruning?
How can a join explode row counts?
How would you prevent this regression?
Data Modeling and Warehousing Questions
Data modeling interviews test whether you can design schemas that are clear, scalable, cost-aware, and useful for analytics, machine learning, and operational reporting.
Data modeling concepts to know
Fact table
A table containing measurable business events or transactions, such as orders, payments, sessions, or shipments.
Dimension table
A table containing descriptive context for facts, such as customers, products, stores, campaigns, or dates.
Grain
The level represented by each row. Declaring grain is essential before building facts, dimensions, metrics, or joins.
Slowly changing dimension
A dimension design that tracks how attributes change over time, such as customer segment, address, or account owner.
Start with core business processes: browsing, cart, orders, payments, shipments, returns, refunds, inventory, and marketing attribution. Each process may become a fact table or event model depending on usage.
Important fact tables include fact_orders at one row per order, fact_order_items at one row per product per order, fact_payments, fact_refunds, fact_shipments, and fact_inventory_snapshot. Dimensions include dim_customer, dim_product, dim_date, dim_store or warehouse, dim_channel, and dim_campaign.
Declare grain explicitly. Revenue by product category should usually come from order item grain, not order grain. Customer lifecycle metrics may use customer or subscription grain. Inventory analysis may need daily snapshot grain.
Model considerations: slowly changing customer or product attributes, refunds and cancellations, discounts at order versus item level, currency, taxes, shipping, guest checkout, and late-arriving data. I would define source-of-truth metrics and build semantic models or marts for finance, marketing, product, and operations.
Validation includes reconciling revenue to finance, order counts to transactional systems, refund totals to payments, and inventory to warehouse systems.
Likely follow-ups
What is the grain of fact_order_items?
How would you model refunds?
How would you handle product category changes over time?
Framework — Type 1 versus Type 2 based on history needs
The right approach depends on whether historical accuracy matters. Type 1 overwrites old values, which is simple and useful when only the latest value matters. Type 2 preserves history by creating a new dimension row for each attribute change, usually with effective_start_date, effective_end_date, current_flag, and a surrogate key.
Example: if a customer changes region from West to East, Type 1 would update the customer row to East. Historical revenue reports would now show old revenue under East, which may be wrong. Type 2 would preserve the old West row and create a new East row, allowing facts to join to the correct version at event time.
Type 2 adds complexity: surrogate keys, date range joins, late-arriving facts, current versus historical reporting, and larger dimension tables. Use it when reporting must reflect attributes as they were at the time of the event.
A strong answer explains the business need before choosing the SCD type.
Likely follow-ups
When is Type 1 acceptable?
How do facts join to a Type 2 dimension?
What happens with late-arriving facts?
Framework — Governance and flexibility versus query simplicity and performance
A star schema separates facts and dimensions. It is useful when you need reusable dimensions, consistent metrics, governance, clear grain, and flexible analysis. It reduces duplication and makes changes to dimensions easier to manage.
A wide denormalized table can be useful for specific high-usage analytics, machine learning features, or dashboards where query simplicity and performance matter. It can reduce join complexity for consumers but may duplicate data and create governance risk if many wide tables define metrics differently.
In practice, I would often maintain clean canonical facts and dimensions, then publish curated marts or wide tables for specific use cases. The warehouse foundation should remain trustworthy, while downstream models optimize for user needs.
The decision depends on scale, consumers, BI tool behavior, cost, latency, and how often business definitions change.
Likely follow-ups
Which is better for BI dashboards?
How do you avoid metric inconsistency?
What is a semantic layer?
Pipeline Design and Orchestration
Pipeline interviews test whether you can design workflows that are idempotent, observable, recoverable, and appropriate for freshness and cost requirements.
A reliable pipeline design flow
1
Clarify source systems, data volume, freshness needs, downstream consumers, and failure tolerance.
2
Choose ingestion pattern: batch extract, CDC, event stream, API pull, file drop, or managed connector.
3
Define landing zone, raw storage, schema handling, and replay strategy.
4
Transform data through clear layers: raw, cleaned/staged, modeled, and serving marts.
5
Make pipelines idempotent so reruns do not duplicate or corrupt data.
6
Add data quality checks, lineage, logging, alerts, metrics, and owner information.
7
Plan for backfills, schema evolution, late data, retries, partial failures, and cost controls.
Framework — Extract -> land raw -> transform -> validate -> publish -> monitor
Clarify requirements first: data freshness, order volume, acceptable lag, source database load constraints, downstream consumers, and whether historical corrections occur.
A simple design: extract orders incrementally using updated_at or CDC, land raw data in object storage or a raw warehouse table, transform into staged tables, then publish modeled facts such as fact_orders and fact_order_items. Orchestrate with Airflow, Dagster, dbt Cloud, or a managed workflow tool.
The pipeline should be idempotent. If it reruns for a date or batch, it should replace or merge data deterministically rather than append duplicates. Use watermarks carefully, and reprocess a lookback window to capture late updates.
Validation checks: row counts, null primary keys, duplicate order IDs, revenue totals, status distribution, freshness, and reconciliation to source. Monitoring should alert on failures, unusual volume changes, high duplicate rates, or delayed data. Document ownership and downstream dependencies.
Likely follow-ups
Why is updated_at not always enough?
How would you avoid loading duplicate orders?
How would you backfill two years of history?
Framework — Same input and rerun produce same output
An idempotent pipeline can be run multiple times with the same input and produce the same correct output without duplicates or unintended side effects. This matters because pipelines fail, get retried, and require backfills.
Example: a non-idempotent pipeline appends yesterday's orders every time it runs. If it is retried after a failure, revenue doubles. An idempotent version would delete and replace the target partition, merge by primary key, or write to a temporary table then atomically swap.
Idempotency requires clear keys, partition strategy, deterministic transformations, and careful handling of side effects. It is one of the most important production data engineering concepts because recoverability depends on it.
Likely follow-ups
How do you make an append-only pipeline idempotent?
First detect schema changes automatically through schema registry, metadata checks, contract tests, or ingestion validation. Then classify the change: additive column, removed column, renamed column, type change, nullability change, or semantic change.
Additive nullable columns are usually safe. Removed columns, renamed fields, and type changes can break downstream transformations and dashboards. Semantic changes are especially dangerous because the schema may look valid while meaning changes.
Protect the pipeline with raw data preservation, compatibility checks, alerts, and contracts for critical sources. For breaking changes, coordinate with source owners, update transformations, backfill if needed, and communicate downstream impact.
A mature answer includes data contracts, versioning, lineage, tests, and clear ownership so schema changes do not silently corrupt analytics.
Likely follow-ups
What is a data contract?
How do you handle a column type change?
What if the source team gives no warning?
Batch and Streaming Systems
Batch and streaming questions test whether you understand latency, throughput, ordering, state, windowing, replay, and the operational tradeoffs between simpler and more real-time architectures.
Batch Processing
Streaming Processing
Best for
Periodic reporting, historical backfills, large transformations, cost-efficient analytics
Real-time features, alerts, fraud detection, live dashboards, low-latency decisions
Main tradeoff
Higher latency but simpler operations and easier replay
Lower latency but more complexity around state, ordering, and failure handling
Late data, partial loads, long runtimes, backfill cost
Duplicates, out-of-order events, checkpointing, state growth, exactly-once semantics
Framework — Decision latency versus complexity
Choose streaming when low latency creates meaningful value. Examples include fraud detection, operational alerts, real-time personalization, inventory updates, live experimentation metrics, and user-facing product features that need fresh events.
Batch is often better for daily reporting, historical analytics, finance reconciliation, and transformations where latency of hours is acceptable. Batch is simpler to operate, easier to replay, and often cheaper.
The decision should consider freshness requirements, event volume, ordering needs, stateful processing, fault tolerance, team expertise, cost, and downstream consumers. Streaming is not automatically better. It adds complexity around duplicates, late events, checkpoints, schema evolution, and monitoring.
A strong answer says: use the simplest architecture that meets the business requirement. If the business only needs daily metrics, do not build a streaming system for prestige.
Start with requirements: event volume, latency target, fraud decision type, tolerance for false positives, and whether decisions block transactions or trigger review.
A possible design: payment events are published to Kafka or Kinesis. A stream processor consumes events, validates schema, deduplicates by event_id, enriches with user, device, merchant, and historical velocity features, then applies rules or a model. High-risk events trigger a decision service, manual review queue, or step-up authentication. Raw events and decisions are written to durable storage for audit and model training.
Key engineering concerns: duplicate events, out-of-order events, feature freshness, state store size, exactly-once or effectively-once processing, backpressure, dead-letter queues, replay, latency monitoring, and model/rule versioning.
Data quality and governance matter because fraud decisions affect customers. Log reason codes, monitor false positive rates, and support rollback if a rule or model behaves badly.
Likely follow-ups
How would you compute velocity features?
How would you replay events safely?
What is the dead-letter queue for?
Spark and Distributed Processing Questions
Spark and distributed processing questions test whether you understand partitioning, shuffles, skew, caching, joins, file formats, and why jobs fail or become expensive.
Framework — Data redistribution across partitions
A shuffle happens when Spark needs to redistribute data across partitions, usually for operations like groupBy, join, distinct, orderBy, and repartition. It is expensive because data must move across the network, be serialized/deserialized, written to disk, and coordinated across executors.
Shuffles often dominate job runtime and can cause failures when data is skewed or intermediate data is too large. A single hot key can send too much data to one partition, creating stragglers or out-of-memory errors.
Ways to reduce shuffle cost include filtering early, selecting only needed columns, pre-aggregating, using broadcast joins for small tables, partitioning by join keys, salting skewed keys, avoiding unnecessary distinct/orderBy, and tuning partition counts.
A strong answer connects the concept to practical debugging: inspect the Spark UI, stages, shuffle read/write, skew, spill, task duration, and executor memory.
Likely follow-ups
What causes data skew?
When would you use a broadcast join?
How do you debug a slow Spark job?
Framework — Identify stage -> reduce data -> fix skew -> tune partitions -> avoid bad operations
First identify where the failure occurs using Spark UI or logs: which stage, operation, partition, and executor. OOM can come from skew, large joins, collect operations, oversized partitions, inefficient UDFs, or caching too much data.
Then reduce data early: filter rows, select only necessary columns, push predicates, and avoid reading unnecessary partitions. Check joins: if one table is small, use broadcast join. If a key is skewed, use salting or split heavy keys. If partitions are too large, increase partition count; if too many tiny partitions, coalesce appropriately.
Avoid collect() on large data, avoid Python UDFs when native functions work, and cache only when reused and when memory allows. For file output, avoid too many small files and choose columnar formats like Parquet with compression.
If tuning resources is needed, adjust executor memory, cores, shuffle partitions, and memory overhead, but do not rely only on bigger clusters. Fixing data shape and query plan is usually more durable.
Parquet is a columnar file format, which is efficient for analytics because queries often read only a subset of columns. It supports compression, encodes data types, stores schema, and enables predicate pushdown and column pruning in many engines.
CSV is row-oriented plain text. It is human-readable and portable, but it lacks strong types, can be larger, slower to scan, more error-prone with delimiters/escaping, and cannot skip irrelevant columns efficiently.
For data lakes and warehouses, Parquet usually reduces storage cost and improves query performance. CSV can still be useful for simple exports, small files, or interoperability, but it is not ideal as the main analytics storage format.
Likely follow-ups
What is predicate pushdown?
When might CSV still be acceptable?
How do small files affect query performance?
Data Quality and Observability
Data quality questions test whether you can catch bad data before it breaks dashboards, models, product features, finance reports, or customer-facing systems.
For a revenue pipeline, I would add checks for freshness, row count, null primary keys, duplicate transaction IDs, valid statuses, valid currencies, non-negative amounts where applicable, accepted date ranges, and referential integrity between orders, payments, refunds, and customers.
I would also add reconciliation checks: total revenue compared with source payment systems, finance reports, or previous pipeline version. Monitor day-over-day and week-over-week changes with thresholds, but account for seasonality and known events.
Business rules matter: canceled orders should not count as revenue, refunds should reduce net revenue, test transactions should be excluded, and currency conversion should use the correct rate.
Alerts should be actionable, routed to an owner, and include context: failed check, affected table, severity, downstream dependencies, and suggested runbook. Too many noisy alerts will be ignored.
First assess impact: which dashboard, metric, users, date range, and decisions are affected. Determine whether the number is wrong or whether the deployment corrected a previous issue.
Compare the new output to the previous version by table, partition, row count, distinct count, metric totals, and key segments. Use lineage to identify upstream changes. Inspect code changes, schema changes, filters, joins, deduplication, and date logic.
If the metric is wrong and business users are affected, mitigate quickly: roll back the transformation, restore previous table version, disable the dashboard, or add a warning. Then perform root-cause analysis and backfill corrected data if needed.
Communication matters. Notify stakeholders with what changed, what is affected, confidence level, expected fix time, and whether past decisions need review.
Likely follow-ups
How do you know whether the old or new number is correct?
What rollback options would you want?
How would lineage help?
Worked Example
Production data incident checklist
The executive revenue dashboard is missing yesterday's data two hours before a leadership meeting.
1
Triage
Check pipeline status, source freshness, warehouse table partitions, failed tasks, and whether the issue affects only revenue or all dashboards.
2
Mitigate
If source data exists, rerun the affected partition. If not, annotate the dashboard and provide the latest available number with caveat.
3
Communicate
Tell stakeholders what is missing, what decisions are affected, expected resolution time, and whether numbers may change.
4
Prevent
Add freshness alerts, upstream source checks, SLA monitoring, and a runbook for revenue pipeline failures.
Result
The response protects trust by combining technical recovery with clear stakeholder communication.
Data Engineering System Design
Data engineering system design interviews evaluate architecture judgment: sources, ingestion, storage, transformation, serving, quality, lineage, governance, and cost.
Start with requirements: event volume, latency, consumers, retention, schema evolution, privacy, and reliability. Product analytics events may power dashboards, experiments, personalization, and machine learning features.
Clients emit events through an SDK with event_name, user_id, session_id, timestamp, properties, app version, platform, and anonymous ID. Events go to an ingestion API or event collector, then into a durable stream like Kafka/Kinesis/Pub/Sub. Raw events are written to object storage for replay and to a warehouse or lakehouse for analytics.
Processing includes schema validation, deduplication, bot/internal filtering, sessionization, identity resolution, and transformation into modeled tables such as fact_events, fact_sessions, dim_users, and product-specific marts. Batch and streaming paths may coexist depending on freshness needs.
Quality and governance: event contracts, schema registry, PII handling, consent enforcement, lineage, ownership, freshness checks, volume anomaly alerts, and documentation. Serving layers include BI dashboards, experimentation analysis, reverse ETL, and feature stores.
Tradeoffs include real-time versus batch cost, strict versus flexible schema, raw event retention, identity complexity, and how much logic belongs in ingestion versus transformation.
A feature store should provide reusable, reliable features for training and serving. Requirements include feature definitions, point-in-time correctness, offline training data, online low-latency serving, freshness, access control, and monitoring.
Offline features can be stored in a warehouse or lakehouse, partitioned by date and entity. Online features can be served from a low-latency key-value store. A feature registry stores metadata: feature name, owner, entity key, transformation logic, freshness SLA, source tables, and description.
Point-in-time correctness is critical. Training data must use feature values available at prediction time, not future values. For online/offline consistency, the same transformation logic should be reused when possible, or thoroughly tested if separate pipelines exist.
Monitoring should cover feature freshness, null rates, distribution drift, online serving latency, training-serving skew, and downstream model impact. Governance includes PII controls, lineage, deprecation, and ownership.
Likely follow-ups
What is point-in-time correctness?
How do you avoid training-serving skew?
What features need online serving?
Behavioral and Collaboration Questions
Behavioral data engineering interviews focus on production ownership, incident response, cross-functional communication, prioritization, and building systems other teams can trust.
Choose an incident with real impact: missing data, wrong metrics, delayed pipeline, duplicate records, broken dashboard, model feature issue, or downstream product impact. Start with who was affected and why it mattered.
Then explain your triage process: logs, orchestration status, source freshness, recent deployments, schema changes, row counts, partitions, lineage, and downstream dependencies. Describe how you mitigated the issue: rerun, rollback, patch, backfill, or stakeholder warning.
The strongest answer includes prevention. Did you add data quality checks, alerts, retries, schema contracts, idempotent writes, runbooks, lineage, or better deployment controls? Avoid blaming upstream teams. Show ownership of the system and communication.
Close with what you learned about reliability and how your process changed afterward.
First clarify exactly what looks wrong: metric, table, date range, dashboard, filter, expected value, and business reason. “The data is wrong” needs to become a reproducible issue.
Then compare the analyst query or dashboard to source tables and model definitions. Check grain, filters, joins, timezones, freshness, metric definitions, and recent changes. Use lineage to trace upstream dependencies.
If the data is wrong, fix it and communicate impact. If the data is correct but the definition differs from expectation, align on metric definition and documentation. If uncertainty remains, explain what is known, what is being checked, and when you will follow up.
The relationship matters. Analysts are data platform customers, and trust is built through fast, clear, and evidence-based responses.
Likely follow-ups
What if the analyst is using the table incorrectly?
I prioritize based on business impact, reliability risk, urgency, downstream dependencies, and leverage. A broken revenue pipeline or compliance issue usually outranks a nice-to-have dashboard model. A platform improvement may outrank a one-off request if it unblocks many teams.
I also distinguish between urgent incidents, strategic platform work, stakeholder requests, technical debt, and operational toil. If the team only handles tickets, reliability and platform quality decay. If the team only builds platform abstractions, business teams may be blocked.
A strong prioritization process includes ownership, SLAs, severity levels, roadmap alignment, and transparent tradeoffs. I would communicate what is being done, what is deferred, why, and what risk that creates.
Likely follow-ups
How do you justify technical debt work?
What if leadership wants a dashboard urgently?
How do you balance incidents and roadmap?
Data Engineer Prep Strategy
Data engineer prep should combine SQL, data modeling, Python, pipeline design, distributed systems, cloud warehouse concepts, and production incident storytelling.
6-week data engineer interview prep plan
1
Week 1: SQL depth. Practice joins, windows, deduplication, incremental models, cohorts, partitions, and performance debugging.
2
Week 2: data modeling. Practice facts, dimensions, grains, SCDs, event models, warehouse marts, and metric definitions.
3
Week 3: pipelines. Practice ETL/ELT design, idempotency, orchestration, backfills, late data, schema changes, and data quality checks.
Week 5: system design. Practice product analytics platforms, CDC ingestion, feature stores, streaming fraud pipelines, and warehouse architecture.
6
Week 6: mock interviews and stories. Prepare pipeline incident, data quality, stakeholder, prioritization, and platform improvement examples.
Role-specific prep by data engineering track
—
Analytics engineering: focus on dbt, SQL models, semantic layers, metric definitions, BI reliability, and stakeholder workflows.
—
Platform data engineering: focus on ingestion systems, orchestration, lineage, governance, access controls, and developer experience.
—
Streaming data engineering: focus on Kafka, Flink/Spark streaming, state, windows, duplicates, ordering, replay, and latency.
—
ML data engineering: focus on feature pipelines, feature stores, point-in-time correctness, training data, and monitoring.
—
Cloud warehouse engineering: focus on Snowflake, BigQuery, Databricks, partitioning, clustering, cost controls, and workload management.
Do not talk only about tools
Interviewers care less that you know Airflow, dbt, Spark, or Kafka by name and more that you understand reliability, data correctness, tradeoffs, and failure modes.
Key Takeaway
Great data engineer interview answers combine SQL correctness, data modeling clarity, pipeline reliability, distributed systems judgment, and production ownership. The best candidates show they can build data systems other teams can trust.
Practice these questions live
Interview Pilot gives you real-time Interview Copilot answer suggestions during live interviews, so you can respond clearly when Data Engineer questions come up.