Visit Official SkillCertPro Website :-
For a full set of 590 questions. Go to
https://skillcertpro.com/product/azure-databricks-data-engineer-associate-dp-750-exam-questions/
SkillCertPro offers detailed explanations to each question which helps to understand the concepts better.
It is recommended to score above 85% in SkillCertPro exams before attempting a real exam.
SkillCertPro updates exam questions every 2 weeks.
You will get life time access and life time free updates
SkillCertPro assures 100% pass guarantee in first attempt.
Question 1:
A data governance team needs to discover all tables across the organization that contain columns tagged as ‘pii‘. Which approach provides this discovery capability?
A.Query the system.information_schema.column_tags table filtering on tag_name = ‘pii‘ to find all tagged columns across all catalogs in the metastore
B.Run a Python script that iterates through every table and checks for PII tags using the Databricks REST API to address the stated requirements as the primary solution for this workload
C.Use the Unity Catalog data explorer to manually browse each schema and visually identify tagged columns
D.Search the workspace Hive metastore for tables with ‘pii‘ in their table properties dictionary
Answer: A
Explanation:
Option A : Unity Catalog exposes metadata through the system.information_schema schema. The column_tags table contains a comprehensive, cross-catalog mapping of tags to columns. By querying this system table filtering for tag_name = ‘pii‘, you gain an immediate, organization-wide inventory of all sensitive data, which is the standard, most efficient discovery method.
Option B (Incorrect): While the REST API can retrieve metadata, iterating through every table using custom scripts is an anti-pattern. It is inefficient, difficult to scale, and requires significant maintenance compared to a single, optimized SQL query against the system tables.
Option C (Incorrect): Manually browsing the Data Explorer is unsuitable for organizational-level discovery. This approach is prone to human error, does not scale as the number of schemas and catalogs grows, and does not provide an exportable or auditable report of sensitive data.
Option D (Incorrect): Searching the legacy “Hive metastore“ is fundamentally incorrect for Unity Catalog-governed environments. Furthermore, relying on arbitrary table properties is not a substitute for the structured, tag-based governance model enforced by Unity Catalog.
Note:
For effective data governance, leverage the power of Unity Catalog‘s metadata layer:
Centralized Metadata: System tables are updated automatically as new tables are created or tags are assigned, ensuring your discovery queries always reflect the current state of the environment.
Auditability: Because these queries are executed against standard system schemas, they are themselves auditable and can be integrated into automated compliance reporting pipelines.
Question 2:
A data engineer discovers that MERGE INTO operations on a Delta table are running slowly. The source DataFrame is 50 MB but the target table is 2 TB. Which optimization reduces MERGE execution time?
A.Partition the target table by the merge key to ensure only relevant partitions are scanned
B.Convert the MERGE to separate DELETE and INSERT statements that can execute in parallel
C.Add a broadcast hint to the source DataFrame in the MERGE statement to eliminate the shuffle join with the large target table
D.Run ANALYZE TABLE on the target to update column statistics before executing the MERGE
Answer: C
Explanation:
C. Add a broadcast hint to the source DataFrame in the MERGE statement to eliminate the shuffle join with the large target table
This is correct because when the source DataFrame is small (50 MB) and the target Delta table is very large (2 TB), performing a full shuffle join is highly inefficient. A broadcast join (using spark.sql.autoBroadcastJoinThreshold or a broadcast hint like /*+ BROADCAST(source) */) replicates the small source DataFrame to all executor nodes, eliminating the expensive shuffle operation . This allows Spark to efficiently identify matching records between the small source and large target without shuffling the 2 TB target dataset . For MERGE operations, reducing the join cost is critical, and broadcasting the small source is a common optimization pattern in Databricks.
Incorrect Options & Why:
A. Partition the target table by the merge key to ensure only relevant partitions are scanned
*Incorrect While partitioning can improve query performance, it does not help the specific issue of a small source joining with a massive target. The bottleneck here is the shuffle join, not full table scans. Additionally, adding partitions after the table has already grown to 2 TB is an expensive and disruptive operation (requires rewriting the entire table). The broadcast hint (option C) is a simpler, non-disruptive optimization that directly addresses the join strategy.*
B. Convert the MERGE to separate DELETE and INSERT statements that can execute in parallel
Incorrect Breaking MERGE into separate DELETE and INSERT operations does not inherently reduce join costs. The DELETE operation would still need to identify which rows to delete, likely requiring the same expensive join. Parallel execution of separate operations does not eliminate the shuffle and may even introduce data consistency issues and additional I/O overhead. MERGE is already optimized by Delta Lake; manually splitting it is not a recommended optimization.
D. Run ANALYZE TABLE on the target to update column statistics before executing the MERGE
Incorrect ANALYZE TABLE updates statistics (such as number of distinct values, min/max, etc.) that help Spark optimize query plans. While useful for certain queries, it does not change the join strategy from shuffle to broadcast. The fundamental issue is the size mismatch between source and target; broadcasting the source (option C) directly addresses that, whereas updated statistics alone do not eliminate the shuffle join.
Question 3:
A data engineer processes data where the same logical record exists in two source tables with complementary attributes. Record matching is based on a shared customer_id. Which join type combines all attributes from both sources while preserving all records?
A.Use an INNER JOIN on customer_id to get only records that exist in both tables with complete attributes
B.Use a FULL OUTER JOIN on customer_id to retain all records from both tables, with nulls where no match exists in the other table
C.Use a LEFT JOIN on customer_id to keep all records from the first table and matching records from the second applied at the appropriate scope level
D.Use a CROSS JOIN to combine every record from both tables and then filter to matching customer_ids
Answer: B
Explanation:
B. Use a FULL OUTER JOIN on customer_id to retain all records from both tables, with nulls where no match exists in the other table
This is correct because a FULL OUTER JOIN returns all records from both tables, matching them on customer_id where possible . If a record exists only in the left table, it is included with NULL values for columns from the right table. If a record exists only in the right table, it is included with NULL values for columns from the left table . This ensures that no data is lost from either source while combining complementary attributes, which is exactly the requirement when two tables have overlapping but not identical sets of records .
Incorrect Options & Why:
A. Use an INNER JOIN on customer_id to get only records that exist in both tables with complete attributes
Incorrect An INNER JOIN retains only records that have matching customer_id values in both tables. Records that exist in only one of the source tables would be completely excluded from the result, violating the requirement to “preserve all records“ from both sources .
C. Use a LEFT JOIN on customer_id to keep all records from the first table and matching records from the second applied at the appropriate scope level
Incorrect A LEFT JOIN (or LEFT OUTER JOIN) preserves all records from the left (first) table and only matching records from the right (second) table . Records that exist only in the right table would be lost, which does not preserve all records from both sources .
D. Use a CROSS JOIN to combine every record from both tables and then filter to matching customer_ids
Incorrect A CROSS JOIN produces a Cartesian product (every row from table A combined with every row from table B), which is computationally expensive and inefficient. Filtering this result to matching customer_id values is logically equivalent to an INNER JOIN, not a FULL OUTER JOIN. This approach would still lose records that exist in only one table, and it introduces massive intermediate result sets, making it impractical for any real-world dataset .
Question 4:
A data engineer needs to set up data lineage tracking for a complex pipeline that reads from multiple sources and writes to several target tables. Which configuration enables comprehensive lineage capture?
A.Enable the workspace-level lineage feature flag and register each pipeline in the lineage configuration file
B.Create a custom lineage table that the pipeline updates manually with source-to-target mappings after each run based on the requirements specified in the scenario
C.Unity Catalog automatically captures table and column-level lineage from all Spark operations and pipeline executions without requiring additional configuration
D.Install the lineage tracking library on each cluster and configure it to emit lineage events to the metastore
Answer: C
Explanation:
Option A (Incorrect): There is no “workspace-level lineage feature flag“ or “lineage configuration file“ that needs to be registered. This would imply a manual, fragile approach that contradicts the automated design of Unity Catalog.
Option B (Incorrect): Creating a custom lineage table is an anti-pattern. Maintaining manual mappings is prone to human error, does not keep pace with evolving pipeline logic, and fails to leverage the built-in, platform-native metadata extraction capabilities of Unity Catalog.
Option C : Unity Catalog automatically captures lineage at both the table and column level. As long as your data is processed using Databricks runtime and stored in Unity Catalog-managed tables, the platform observes the Spark execution plan. It automatically records the relationships between source tables and target tables, providing a lineage graph that persists without any additional configuration or manual instrumentation.
Option D (Incorrect): Installing a custom library for lineage is unnecessary and unsupported. Unity Catalogs lineage engine is part of the core platform infrastructure; it does not rely on third-party libraries or custom agents running on clusters.
Understanding Automated Lineage
To leverage lineage for impact analysis and troubleshooting, understand how Unity Catalog constructs the graph:
Granularity: Lineage is tracked at both the table level (which tables depend on which) and the column level (which upstream columns influence a specific downstream column).
Visibility: You can access this lineage directly in the Unity Catalog Data Explorer UI or programmatically via the Unity Catalog REST API.
Scope: It captures operations from Spark SQL, Delta Live Tables (DLT), and other standard operations automatically, simplifying compliance audits and debugging.
Question 5:
A Spark job frequently spills data to disk during shuffle operations. The DAG in the Spark UI confirms high shuffle spill metrics. Which configuration change reduces spilling?
A.Decrease shuffle partitions to reduce the total number of shuffle files written to disk
B.Disable shuffle spilling entirely so tasks fail immediately instead of degrading performance with disk I/O based on the requirements specified in the scenario
C.Set spark.shuffle.compress to false to reduce CPU overhead during shuffle operations
D.Increase spark.sql.shuffle.partitions to distribute data across more tasks, reducing the memory requirement per task
Answer: D
Explanation:
D. Increase spark.sql.shuffle.partitions to distribute data across more tasks, reducing the memory requirement per task
This is correct because shuffle spill occurs when a task‘s memory buffer exceeds its available memory, forcing data to be written to disk . By increasing spark.sql.shuffle.partitions (e.g., from the default 200 to a higher value), the same total data volume is divided into smaller partitions . Each task processes fewer records, reducing the memory footprint per task and decreasing the likelihood of spilling to disk . This directly addresses high shuffle spill metrics visible in the Spark UI.
Incorrect Options & Why:
A. Decrease shuffle partitions to reduce the total number of shuffle files written to disk
Incorrect Decreasing the number of shuffle partitions makes each partition larger, which increases the memory required per task. This will worsen spill, not reduce it. Fewer partitions mean each task must handle more data, leading to more frequent disk spill, not less.
C. Set spark.shuffle.compress to false to reduce CPU overhead during shuffle operations
Incorrect Disabling shuffle compression reduces CPU overhead but does not address memory pressure or spill. In fact, without compression, shuffle data takes more memory and disk space, potentially increasing spill. Spill is a memory issue, not a compression or CPU issue.
B. Disable shuffle spilling entirely so tasks fail immediately instead of degrading performance with disk I/O
Incorrect Shuffle spilling cannot be disabled in Spark; it is a safety mechanism to prevent out-of-memory (OOM) errors . Intentionally causing tasks to fail instead of spilling is not a valid optimization. The goal is to reduce spill, not prevent it by failing tasks.
For a full set of 590 questions. Go to
https://skillcertpro.com/product/azure-databricks-data-engineer-associate-dp-750-exam-questions/
SkillCertPro offers detailed explanations to each question which helps to understand the concepts better.
It is recommended to score above 85% in SkillCertPro exams before attempting a real exam.
SkillCertPro updates exam questions every 2 weeks.
You will get life time access and life time free updates
SkillCertPro assures 100% pass guarantee in first attempt.
Question 6:
An external auditor requests evidence that a specific production table has not been modified by unauthorized users in the past quarter. Which Unity Catalog feature provides this evidence?
A.Query the system.access.audit table filtered by the table name and date range to show all write operations with user identities
B.Query the system.billing.usage table filtered by the table name to show which users consumed compute accessing it
C.Export the Delta table‘s transaction log JSON files which contain the user identity for each commit
D.Run DESCRIBE HISTORY on the table which shows all modification operations with user identifiers for each version
Answer: A
Explanation:
A. Query the system.access.audit table filtered by the table name and date range to show all write operations with user identities is correct because the system.access.audit system table is the authoritative source for audit logs in Unity Catalog . It records all actions, including write operations such as updates, inserts, and deletes, along with the identity of the user who performed them. This table retains data for 365 days, which covers the “past quarter“ requirement, and can be filtered by table name (request_params.full_name_arg) and date range (event_date) to provide the specific evidence an auditor would require .
Incorrect Options & Why:
C. Export the Delta table‘s transaction log JSON files is incorrect because while the Delta transaction log does record operations, it is a low-level file format not designed for direct querying or external auditing. It lacks the structured, queryable, and centralized governance features of Unity Catalog‘s system tables. Accessing it directly is impractical and does not represent a standard, supportable audit mechanism.
B. Query the system.billing.usage table is incorrect because this table is designed for tracking compute usage and costs, not for recording data modification events or user identities for audit purposes. It provides no information on who modified a specific table.
D. Run DESCRIBE HISTORY on the table is incorrect because the DESCRIBE HISTORY command, while useful for viewing a table‘s version history, has a critical limitation. By default, the transaction log history is only retained for 30 days (configurable up to a limit), which is insufficient for a query looking back over a “past quarter“ (approximately 90 days) . This makes it an unreliable source for long-term audit evidence.
Question 7:
A data engineer needs to implement a data quality check that rejects writes to a Delta table where the order_quantity column is negative. Which feature enforces this at write time?
A.Create a view with a WHERE clause that filters negative values at read time
B.Add a CHECK constraint: ALTER TABLE orders ADD CONSTRAINT positive_qty CHECK (order_quantity >= 0)
C.Configure Auto Loader to skip records with negative quantities during ingestion
D.Write a pre-processing UDF that converts negative values to zero before each write operation
Answer: B
Explanation:
Option A (Incorrect): A view only filters data at read time. This does not prevent invalid (negative) data from being physically written to the underlying table. It fails to ensure the persistence layer maintains the required data quality standards.
Option B : Adding a CHECK constraint (e.g., ALTER TABLE orders ADD CONSTRAINT positive_qty CHECK (order_quantity >= 0)) is the declarative, native way to enforce data quality rules at write time. If a write operation attempts to insert or update a record where order_quantity is less than zero, the transaction will fail, ensuring that the table remains in a consistent and valid state.
Option C (Incorrect): While Auto Loader is an excellent tool for incremental ingestion, configuring it to “skip records“ is a data loss strategy, not an enforcement strategy. Furthermore, skipping records does not protect the table from being corrupted by batch writes, manual updates, or other ingestion sources.
Option D (Incorrect): While a UDF can transform data, relying on custom pre-processing code is not as robust or performant as a native Delta constraint. Constraints are managed by the storage layer and are universally enforced regardless of the source or method used to perform the write (e.g., SQL, Python, or even manual batch inserts).
Note:
To maintain a high-quality Medallion Architecture, leverage native Delta Lake features for data governance:
Constraint Enforcement: Always prefer native SQL constraints over application-side filtering to ensure consistent data quality across all possible write paths.
Performance: Delta Lake validates these constraints during the commit phase, ensuring minimal impact on throughput while guaranteeing data validity.
Question 8:
A data engineer needs to convert a table from long format to wide format. The table has columns for product_id, attribute_name, and attribute_value. Each product has multiple attribute rows. Which SQL operation restructures this?
A.Use UNPIVOT to transform the rows into columns using the attribute_name as the column header with each category value specified explicitly in the IN clause of the PIVOT expression
B.Use PIVOT on the attribute_name column to create a separate column for each distinct attribute, populated with the corresponding attribute_value
C.Use GROUP BY product_id with COLLECT_LIST on attribute_name and attribute_value to create array columns
D.Use LATERAL VIEW explode to expand each attribute into its own column dynamically
Answer: B
Explanation:
B. Use PIVOT on the attribute_name column to create a separate column for each distinct attribute, populated with the corresponding attribute_value
This is correct because PIVOT is the standard SQL operation to convert long format (rows) to wide format (columns). The attribute_name values become column headers, and attribute_value fills the cells. In Fabric Synapse Data Warehouse and Spark SQL, PIVOT restructures exactly as described.
Incorrect:
A. Use UNPIVOT…
Incorrect UNPIVOT does the opposite: it converts wide format to long format, not long to wide.
C. Use GROUP BY product_id with COLLECT_LIST…
Incorrect COLLECT_LIST creates arrays (nested structures), not separate columns. This results in a complex type, not a true wide-format relational table.
D. Use LATERAL VIEW explode…
Incorrect LATERAL VIEW explode splits array or map rows into multiple rows (longer format), not into columns. It does not pivot.
Question 9:
A compliance team requires that all accesses to a sensitive customer table are recorded with the querying user‘s identity and timestamp. Which Unity Catalog capability provides this automatically?
A.Unity Catalog audit logging automatically records all data access events including user identity, action, target object, and timestamp without additional configuration
B.The engineer must enable per-table access logging using ALTER TABLE SET TBLPROPERTIES (‘audit.enabled‘ = ‘true‘)
C.DESCRIBE HISTORY on the table records read events alongside write operations for complete audit coverage which provides a chronological record of all table modifications with timestamps and user details
D.Access logging requires deploying a custom Spark listener that records query events to a monitoring table
Answer: A
Explanation:
Option A : Unity Catalog provides comprehensive, built-in Audit Logging. It automatically records all eventsincluding SELECT (data access) and DDL/DML operationswithin the account. Each log entry captures critical metadata, including the identity of the user, the timestamp of the request, the specific object accessed, and the outcome of the request. This requires no manual configuration or per-table setup; it is a native feature of the platform.
Option B (Incorrect): There is no audit.enabled table property in Unity Catalog. Audit logging is a platform-level service managed by the account administrator via diagnostic settings (exported to Azure Monitor or other logging sinks), not a per-table configuration.
Option C (Incorrect): While DESCRIBE HISTORY is a feature of Delta Lake, it only tracks write operations (e.g., inserts, updates, deletes) to the Delta table‘s transaction log. It does not record SELECT queries or read events. Therefore, it is insufficient for auditing data access by users.
Option D (Incorrect): Deploying a custom Spark listener is an unnecessary and fragile approach. It is not part of the standard architecture, difficult to maintain, and would not integrate seamlessly with the centralized governance model provided by Unity Catalog.
Note:
To ensure compliance, data engineers and administrators should focus on how these logs are consumed rather than how they are generated:
Centralized Governance: Audit logs from Unity Catalog are typically streamed to Azure Log Analytics or Azure Storage, where they can be queried using Kusto Query Language (KQL) to generate compliance reports.
Scope: The logs cover the entire metastore, providing a unified view of access across all workspaces associated with that Unity Catalog metastore.
Question 10:
A data engineer needs to read binary files such as images from ADLS Gen2 using Auto Loader for a computer vision pipeline. Which format option reads file contents as binary data?
A.Use cloudFiles.format = ‘bytes‘ with a custom deserialization schema for binary data
B.Set cloudFiles.format to ‘image‘ which decodes common image formats into pixel arrays automatically
C.Read file paths using Auto Loader in text mode and load binary content separately using Python‘s open() function according to the recommended configuration
D.Set cloudFiles.format to binaryFile which reads each file as a row containing the path, modification time, length, and binary content
Answer: D
Explanation:
D. Set cloudFiles.format to binaryFile which reads each file as a row containing the path, modification time, length, and binary content
This is correct because in Auto Loader (Spark Structured Streaming), the binaryFile format reads binary files (images, PDFs, etc.) directly from ADLS Gen2. Each file becomes a row with metadata columns (filePath, modificationTime, length) and the binary content in the content column, which is ideal for computer vision pipelines.
Incorrect:
A. Use cloudFiles.format = ‘bytes‘ with a custom deserialization schema for binary data
Incorrect bytes is not a valid cloudFiles.format option in Auto Loader. Auto Loader supports formats like parquet, json, csv, text, binaryFile, etc., but not bytes.
B. Set cloudFiles.format to ‘image‘ which decodes common image formats into pixel arrays automatically
Incorrect While Spark does have an image data source, it is not directly supported as a cloudFiles.format option in Auto Loader for incremental binary file ingestion. Auto Loader focuses on file discovery and streaming, not inline image decoding.
C. Read file paths using Auto Loader in text mode and load binary content separately using Python‘s open() function according to the recommended configuration
Incorrect This is inefficient and not recommended in Fabric. Using Python‘s open() on driver or executor nodes bypasses Auto Loader‘s optimized binary handling, causes performance bottlenecks, and violates the declarative pattern of Spark DataFrames.
For a full set of 590 questions. Go to
https://skillcertpro.com/product/azure-databricks-data-engineer-associate-dp-750-exam-questions/
SkillCertPro offers detailed explanations to each question which helps to understand the concepts better.
It is recommended to score above 85% in SkillCertPro exams before attempting a real exam.
SkillCertPro updates exam questions every 2 weeks.
You will get life time access and life time free updates
SkillCertPro assures 100% pass guarantee in first attempt.