Interview Questions
Data Engineer Interview Questions
Practice data engineer interview questions across SQL transformations, data modeling, ETL pipelines, orchestration, batch and streaming systems, Spark, data quality, observability, system design, and behavioral collaboration. Use this as a focused question list alongside the full Data Engineer Interview Guide.
21 questions
8 categories
Data Engineer
Updated May 2026
SQL and Data Transformation Questions
Data engineer SQL questions are usually about correctness and production-readiness: grain, deduplication, incremental logic, window functions, partitions, and performance.
Framework — Window function with deterministic ordering
Use a window function to assign a row number within each event_id partition. Order by the most reliable freshness column descending, such as ingestion_timestamp or updated_at. If ties are possible, add a deterministic tie-breaker like source_sequence_number or raw_file_name. The query pattern is: select all rows, calculate row_number() over(partition by event_id order by ingestion_timestamp desc, source_sequence_number desc), then filter where row_number = 1. The important interview detail is determinism. If two records have the same timestamp and no tie-breaker, the result may be unstable between runs. Also clarify whether the latest event means latest ingestion time, latest event time, or latest source update time. For production, I would monitor duplicate rate, late-arriving events, null event_id count, and whether deduplication changes downstream metrics unexpectedly.
Likely follow-ups
What if event_id is missing?
How would you handle late-arriving corrections?
How would this change for incremental processing?
Framework — Filter active events -> normalize date -> group by date and user
First define active user. Usually, not every event should count. Page views, bot events, background refreshes, and internal users may need exclusion. Choose meaningful events such as session_start, purchase, message_sent, or feature_used depending on the product. Create a transformation that filters to eligible events, converts event timestamps to the business timezone, groups by activity_date and user_id, and outputs one row per active user per day. If the final table is an aggregate, count distinct users by day. If the table is a serving model, store date, user_id, platform, country, and first_active_timestamp. Important production details include timezone, late-arriving events, bot/test filtering, deleted users, duplicate events, and partitioning by activity_date. For incremental builds, process recent partitions again to capture late events, not only the current day. A strong answer also includes validation: compare DAU against source event counts, monitor day-over-day changes, and check segment-level consistency.
Likely follow-ups
Which events should count as active?
How would you handle late events?
Should DAU be a table or a metric query?
Framework — Scope -> query plan -> data growth -> joins -> partitions -> recent changes
First check whether the issue is query-specific, warehouse-wide, or infrastructure-related. If many queries slowed down, it may be cluster capacity, warehouse load, or service degradation. If one query slowed down, inspect the query and data. Look at recent changes: table size growth, schema changes, new joins, changed filters, missing partition pruning, stale statistics, data skew, exploding joins, or added distinct/order by operations. Review the query plan if available. Common causes include joining at the wrong grain, missing date filters, many-to-many joins, scanning all partitions, inefficient window functions, large shuffles, or using functions on partition columns that prevent pruning. Fixes may include filtering earlier, pre-aggregating, adding partition filters, clustering/sorting, materializing intermediate models, updating statistics, reducing selected columns, or correcting join logic. I would also add monitoring for runtime, bytes scanned, row counts, and cost.
Likely follow-ups
What is partition pruning?
How can a join explode row counts?
How would you prevent this regression?
Data Modeling and Warehousing Questions
Data modeling interviews test whether you can design schemas that are clear, scalable, cost-aware, and useful for analytics, machine learning, and operational reporting.
Framework — Business processes -> facts -> dimensions -> grain -> metrics
Start with core business processes: browsing, cart, orders, payments, shipments, returns, refunds, inventory, and marketing attribution. Each process may become a fact table or event model depending on usage. Important fact tables include fact_orders at one row per order, fact_order_items at one row per product per order, fact_payments, fact_refunds, fact_shipments, and fact_inventory_snapshot. Dimensions include dim_customer, dim_product, dim_date, dim_store or warehouse, dim_channel, and dim_campaign. Declare grain explicitly. Revenue by product category should usually come from order item grain, not order grain. Customer lifecycle metrics may use customer or subscription grain. Inventory analysis may need daily snapshot grain. Model considerations: slowly changing customer or product attributes, refunds and cancellations, discounts at order versus item level, currency, taxes, shipping, guest checkout, and late-arriving data. I would define source-of-truth metrics and build semantic models or marts for finance, marketing, product, and operations. Validation includes reconciling revenue to finance, order counts to transactional systems, refund totals to payments, and inventory to warehouse systems.
Likely follow-ups
What is the grain of fact_order_items?
How would you model refunds?
How would you handle product category changes over time?
Framework — Type 1 versus Type 2 based on history needs
The right approach depends on whether historical accuracy matters. Type 1 overwrites old values, which is simple and useful when only the latest value matters. Type 2 preserves history by creating a new dimension row for each attribute change, usually with effective_start_date, effective_end_date, current_flag, and a surrogate key. Example: if a customer changes region from West to East, Type 1 would update the customer row to East. Historical revenue reports would now show old revenue under East, which may be wrong. Type 2 would preserve the old West row and create a new East row, allowing facts to join to the correct version at event time. Type 2 adds complexity: surrogate keys, date range joins, late-arriving facts, current versus historical reporting, and larger dimension tables. Use it when reporting must reflect attributes as they were at the time of the event. A strong answer explains the business need before choosing the SCD type.
Likely follow-ups
When is Type 1 acceptable?
How do facts join to a Type 2 dimension?
What happens with late-arriving facts?
Framework — Governance and flexibility versus query simplicity and performance
A star schema separates facts and dimensions. It is useful when you need reusable dimensions, consistent metrics, governance, clear grain, and flexible analysis. It reduces duplication and makes changes to dimensions easier to manage. A wide denormalized table can be useful for specific high-usage analytics, machine learning features, or dashboards where query simplicity and performance matter. It can reduce join complexity for consumers but may duplicate data and create governance risk if many wide tables define metrics differently. In practice, I would often maintain clean canonical facts and dimensions, then publish curated marts or wide tables for specific use cases. The warehouse foundation should remain trustworthy, while downstream models optimize for user needs. The decision depends on scale, consumers, BI tool behavior, cost, latency, and how often business definitions change.
Likely follow-ups
Which is better for BI dashboards?
How do you avoid metric inconsistency?
What is a semantic layer?
Pipeline Design and Orchestration
Pipeline interviews test whether you can design workflows that are idempotent, observable, recoverable, and appropriate for freshness and cost requirements.
Framework — Extract -> land raw -> transform -> validate -> publish -> monitor
Clarify requirements first: data freshness, order volume, acceptable lag, source database load constraints, downstream consumers, and whether historical corrections occur. A simple design: extract orders incrementally using updated_at or CDC, land raw data in object storage or a raw warehouse table, transform into staged tables, then publish modeled facts such as fact_orders and fact_order_items. Orchestrate with Airflow, Dagster, dbt Cloud, or a managed workflow tool. The pipeline should be idempotent. If it reruns for a date or batch, it should replace or merge data deterministically rather than append duplicates. Use watermarks carefully, and reprocess a lookback window to capture late updates. Validation checks: row counts, null primary keys, duplicate order IDs, revenue totals, status distribution, freshness, and reconciliation to source. Monitoring should alert on failures, unusual volume changes, high duplicate rates, or delayed data. Document ownership and downstream dependencies.
Likely follow-ups
Why is updated_at not always enough?
How would you avoid loading duplicate orders?
How would you backfill two years of history?
Framework — Same input and rerun produce same output
An idempotent pipeline can be run multiple times with the same input and produce the same correct output without duplicates or unintended side effects. This matters because pipelines fail, get retried, and require backfills. Example: a non-idempotent pipeline appends yesterday's orders every time it runs. If it is retried after a failure, revenue doubles. An idempotent version would delete and replace the target partition, merge by primary key, or write to a temporary table then atomically swap. Idempotency requires clear keys, partition strategy, deterministic transformations, and careful handling of side effects. It is one of the most important production data engineering concepts because recoverability depends on it.
Likely follow-ups
How do you make an append-only pipeline idempotent?
What is an atomic swap?
How does idempotency help with backfills?
Framework — Detect -> classify -> protect -> communicate -> migrate
First detect schema changes automatically through schema registry, metadata checks, contract tests, or ingestion validation. Then classify the change: additive column, removed column, renamed column, type change, nullability change, or semantic change. Additive nullable columns are usually safe. Removed columns, renamed fields, and type changes can break downstream transformations and dashboards. Semantic changes are especially dangerous because the schema may look valid while meaning changes. Protect the pipeline with raw data preservation, compatibility checks, alerts, and contracts for critical sources. For breaking changes, coordinate with source owners, update transformations, backfill if needed, and communicate downstream impact. A mature answer includes data contracts, versioning, lineage, tests, and clear ownership so schema changes do not silently corrupt analytics.
Likely follow-ups
What is a data contract?
How do you handle a column type change?
What if the source team gives no warning?
Batch and Streaming Systems
Batch and streaming questions test whether you understand latency, throughput, ordering, state, windowing, replay, and the operational tradeoffs between simpler and more real-time architectures.
Framework — Decision latency versus complexity
Choose streaming when low latency creates meaningful value. Examples include fraud detection, operational alerts, real-time personalization, inventory updates, live experimentation metrics, and user-facing product features that need fresh events. Batch is often better for daily reporting, historical analytics, finance reconciliation, and transformations where latency of hours is acceptable. Batch is simpler to operate, easier to replay, and often cheaper. The decision should consider freshness requirements, event volume, ordering needs, stateful processing, fault tolerance, team expertise, cost, and downstream consumers. Streaming is not automatically better. It adds complexity around duplicates, late events, checkpoints, schema evolution, and monitoring. A strong answer says: use the simplest architecture that meets the business requirement. If the business only needs daily metrics, do not build a streaming system for prestige.
Likely follow-ups
What are late events?
How do you handle duplicates in streaming?
What does exactly-once mean?
Framework — Events -> stream -> enrichment -> rules/model -> action -> storage -> monitoring
Start with requirements: event volume, latency target, fraud decision type, tolerance for false positives, and whether decisions block transactions or trigger review. A possible design: payment events are published to Kafka or Kinesis. A stream processor consumes events, validates schema, deduplicates by event_id, enriches with user, device, merchant, and historical velocity features, then applies rules or a model. High-risk events trigger a decision service, manual review queue, or step-up authentication. Raw events and decisions are written to durable storage for audit and model training. Key engineering concerns: duplicate events, out-of-order events, feature freshness, state store size, exactly-once or effectively-once processing, backpressure, dead-letter queues, replay, latency monitoring, and model/rule versioning. Data quality and governance matter because fraud decisions affect customers. Log reason codes, monitor false positive rates, and support rollback if a rule or model behaves badly.
Likely follow-ups
How would you compute velocity features?
How would you replay events safely?
What is the dead-letter queue for?
Spark and Distributed Processing Questions
Spark and distributed processing questions test whether you understand partitioning, shuffles, skew, caching, joins, file formats, and why jobs fail or become expensive.
Framework — Data redistribution across partitions
A shuffle happens when Spark needs to redistribute data across partitions, usually for operations like groupBy, join, distinct, orderBy, and repartition. It is expensive because data must move across the network, be serialized/deserialized, written to disk, and coordinated across executors. Shuffles often dominate job runtime and can cause failures when data is skewed or intermediate data is too large. A single hot key can send too much data to one partition, creating stragglers or out-of-memory errors. Ways to reduce shuffle cost include filtering early, selecting only needed columns, pre-aggregating, using broadcast joins for small tables, partitioning by join keys, salting skewed keys, avoiding unnecessary distinct/orderBy, and tuning partition counts. A strong answer connects the concept to practical debugging: inspect the Spark UI, stages, shuffle read/write, skew, spill, task duration, and executor memory.
Likely follow-ups
What causes data skew?
When would you use a broadcast join?
How do you debug a slow Spark job?
Framework — Identify stage -> reduce data -> fix skew -> tune partitions -> avoid bad operations
First identify where the failure occurs using Spark UI or logs: which stage, operation, partition, and executor. OOM can come from skew, large joins, collect operations, oversized partitions, inefficient UDFs, or caching too much data. Then reduce data early: filter rows, select only necessary columns, push predicates, and avoid reading unnecessary partitions. Check joins: if one table is small, use broadcast join. If a key is skewed, use salting or split heavy keys. If partitions are too large, increase partition count; if too many tiny partitions, coalesce appropriately. Avoid collect() on large data, avoid Python UDFs when native functions work, and cache only when reused and when memory allows. For file output, avoid too many small files and choose columnar formats like Parquet with compression. If tuning resources is needed, adjust executor memory, cores, shuffle partitions, and memory overhead, but do not rely only on bigger clusters. Fixing data shape and query plan is usually more durable.
Likely follow-ups
How do you detect skew?
Why can too many partitions be bad?
When should you cache a DataFrame?
Framework — Columnar storage, schema, compression, predicate pushdown
Parquet is a columnar file format, which is efficient for analytics because queries often read only a subset of columns. It supports compression, encodes data types, stores schema, and enables predicate pushdown and column pruning in many engines. CSV is row-oriented plain text. It is human-readable and portable, but it lacks strong types, can be larger, slower to scan, more error-prone with delimiters/escaping, and cannot skip irrelevant columns efficiently. For data lakes and warehouses, Parquet usually reduces storage cost and improves query performance. CSV can still be useful for simple exports, small files, or interoperability, but it is not ideal as the main analytics storage format.
Likely follow-ups
What is predicate pushdown?
When might CSV still be acceptable?
How do small files affect query performance?
Data Quality and Observability
Data quality questions test whether you can catch bad data before it breaks dashboards, models, product features, finance reports, or customer-facing systems.
Framework — Freshness -> completeness -> uniqueness -> validity -> reconciliation
For a revenue pipeline, I would add checks for freshness, row count, null primary keys, duplicate transaction IDs, valid statuses, valid currencies, non-negative amounts where applicable, accepted date ranges, and referential integrity between orders, payments, refunds, and customers. I would also add reconciliation checks: total revenue compared with source payment systems, finance reports, or previous pipeline version. Monitor day-over-day and week-over-week changes with thresholds, but account for seasonality and known events. Business rules matter: canceled orders should not count as revenue, refunds should reduce net revenue, test transactions should be excluded, and currency conversion should use the correct rate. Alerts should be actionable, routed to an owner, and include context: failed check, affected table, severity, downstream dependencies, and suggested runbook. Too many noisy alerts will be ignored.
Likely follow-ups
How do you choose alert thresholds?
What if a check fails during month-end close?
How would you prevent duplicate revenue?
Framework — Assess impact -> compare versions -> inspect lineage -> mitigate -> root cause
First assess impact: which dashboard, metric, users, date range, and decisions are affected. Determine whether the number is wrong or whether the deployment corrected a previous issue. Compare the new output to the previous version by table, partition, row count, distinct count, metric totals, and key segments. Use lineage to identify upstream changes. Inspect code changes, schema changes, filters, joins, deduplication, and date logic. If the metric is wrong and business users are affected, mitigate quickly: roll back the transformation, restore previous table version, disable the dashboard, or add a warning. Then perform root-cause analysis and backfill corrected data if needed. Communication matters. Notify stakeholders with what changed, what is affected, confidence level, expected fix time, and whether past decisions need review.
Likely follow-ups
How do you know whether the old or new number is correct?
What rollback options would you want?
How would lineage help?
Data Engineering System Design
Data engineering system design interviews evaluate architecture judgment: sources, ingestion, storage, transformation, serving, quality, lineage, governance, and cost.
Framework — Instrumentation -> ingestion -> storage -> processing -> modeling -> serving -> governance
Start with requirements: event volume, latency, consumers, retention, schema evolution, privacy, and reliability. Product analytics events may power dashboards, experiments, personalization, and machine learning features. Clients emit events through an SDK with event_name, user_id, session_id, timestamp, properties, app version, platform, and anonymous ID. Events go to an ingestion API or event collector, then into a durable stream like Kafka/Kinesis/Pub/Sub. Raw events are written to object storage for replay and to a warehouse or lakehouse for analytics. Processing includes schema validation, deduplication, bot/internal filtering, sessionization, identity resolution, and transformation into modeled tables such as fact_events, fact_sessions, dim_users, and product-specific marts. Batch and streaming paths may coexist depending on freshness needs. Quality and governance: event contracts, schema registry, PII handling, consent enforcement, lineage, ownership, freshness checks, volume anomaly alerts, and documentation. Serving layers include BI dashboards, experimentation analysis, reverse ETL, and feature stores. Tradeoffs include real-time versus batch cost, strict versus flexible schema, raw event retention, identity complexity, and how much logic belongs in ingestion versus transformation.
Likely follow-ups
How would you handle schema evolution?
How would you deduplicate events?
What tables would analysts query?
Framework — Offline/online consistency -> feature definitions -> freshness -> serving -> monitoring
A feature store should provide reusable, reliable features for training and serving. Requirements include feature definitions, point-in-time correctness, offline training data, online low-latency serving, freshness, access control, and monitoring. Offline features can be stored in a warehouse or lakehouse, partitioned by date and entity. Online features can be served from a low-latency key-value store. A feature registry stores metadata: feature name, owner, entity key, transformation logic, freshness SLA, source tables, and description. Point-in-time correctness is critical. Training data must use feature values available at prediction time, not future values. For online/offline consistency, the same transformation logic should be reused when possible, or thoroughly tested if separate pipelines exist. Monitoring should cover feature freshness, null rates, distribution drift, online serving latency, training-serving skew, and downstream model impact. Governance includes PII controls, lineage, deprecation, and ownership.
Likely follow-ups
What is point-in-time correctness?
How do you avoid training-serving skew?
What features need online serving?
Behavioral and Collaboration Questions
Behavioral data engineering interviews focus on production ownership, incident response, cross-functional communication, prioritization, and building systems other teams can trust.
Framework — Impact -> triage -> mitigation -> root cause -> prevention
Choose an incident with real impact: missing data, wrong metrics, delayed pipeline, duplicate records, broken dashboard, model feature issue, or downstream product impact. Start with who was affected and why it mattered. Then explain your triage process: logs, orchestration status, source freshness, recent deployments, schema changes, row counts, partitions, lineage, and downstream dependencies. Describe how you mitigated the issue: rerun, rollback, patch, backfill, or stakeholder warning. The strongest answer includes prevention. Did you add data quality checks, alerts, retries, schema contracts, idempotent writes, runbooks, lineage, or better deployment controls? Avoid blaming upstream teams. Show ownership of the system and communication. Close with what you learned about reliability and how your process changed afterward.
Likely follow-ups
How did you communicate impact?
What would have caught the issue earlier?
How did you prevent recurrence?
Framework — Clarify -> reproduce -> trace lineage -> resolve -> document
First clarify exactly what looks wrong: metric, table, date range, dashboard, filter, expected value, and business reason. “The data is wrong” needs to become a reproducible issue. Then compare the analyst query or dashboard to source tables and model definitions. Check grain, filters, joins, timezones, freshness, metric definitions, and recent changes. Use lineage to trace upstream dependencies. If the data is wrong, fix it and communicate impact. If the data is correct but the definition differs from expectation, align on metric definition and documentation. If uncertainty remains, explain what is known, what is being checked, and when you will follow up. The relationship matters. Analysts are data platform customers, and trust is built through fast, clear, and evidence-based responses.
Likely follow-ups
What if the analyst is using the table incorrectly?
How do you prevent repeated confusion?
What documentation helps most?
Framework — Impact -> reliability -> urgency -> dependencies -> leverage
I prioritize based on business impact, reliability risk, urgency, downstream dependencies, and leverage. A broken revenue pipeline or compliance issue usually outranks a nice-to-have dashboard model. A platform improvement may outrank a one-off request if it unblocks many teams. I also distinguish between urgent incidents, strategic platform work, stakeholder requests, technical debt, and operational toil. If the team only handles tickets, reliability and platform quality decay. If the team only builds platform abstractions, business teams may be blocked. A strong prioritization process includes ownership, SLAs, severity levels, roadmap alignment, and transparent tradeoffs. I would communicate what is being done, what is deferred, why, and what risk that creates.
Likely follow-ups
How do you justify technical debt work?
What if leadership wants a dashboard urgently?
How do you balance incidents and roadmap?
Practice these answers live
Interview Pilot gives you real-time Copilot answer suggestions during live interviews, so you can respond clearly when these questions come up.
