Microsoft DP-203 Exam Details

Microsoft DP-203 Exam Course Outline

Microsoft provides a course outline for the DP-203 exam that covers the major sections for getting better understanding during the preparation time. The topics are:

Topic 1: Design and Implement Data Storage

Design a data storage structure

design an Azure Data Lake solution (Microsoft Documentation: Azure Data Lake Storage Gen2)
recommend file types for storage (Microsoft Documentation: Example scenarios)
recommend file types for analytical queries (Microsoft Documentation: Query data in Azure Data Lake using Azure Data Explorer)
design for efficient querying (Microsoft Documentation: Design for querying, Design Guidelines)
design for data pruning (Microsoft Documentation: Dynamic file pruning)
designing a folder structure that represents the levels of data transformation (Microsoft Documentation: Copying and transforming data in Azure Data Lake Storage Gen2)
design a distribution strategy (Microsoft Documentation: Designing distributed tables)
designing a data archiving solution

Design a partition strategy

design a partition strategy for files (Microsoft Documentation: Copy new files based on time partitioned file name using the Copy Data tool)
designing a partition strategy for analytical workloads (Microsoft Documentation: Best practices when using Delta Lake, Partitions in tabular models)
design a partition strategy for efficiency/performance (Microsoft Documentation: Designing partitions for query performance)
designing a partition strategy for Azure Synapse Analytics (Microsoft Documentation: Partitioning tables)
identify when partitioning is needed in Azure Data Lake Storage Gen2

Design the serving layer

design star schemas (Microsoft Documentation: Overview of Star schema)
designing slowly changing dimensions
design a dimensional hierarchy (Microsoft Documentation: Hierarchies in tabular models)
design a solution for temporal data (Microsoft Documentation: Temporal tables in Azure SQL Database and Azure SQL Managed Instance)
designing for incremental loading (Microsoft Documentation: Incrementally load data from a source data store to a destination data store, Load data from Azure SQL Database to Azure Blob storage using the Azure portal)
design analytical stores (Microsoft Documentation: Selecting an analytical data store in Azure, Azure Cosmos DB analytical store)
designing metastores in Azure Synapse Analytics and Azure Databricks (Microsoft Documentation: Azure Synapse Analytics shared metadata tables)

Implement physical data storage structures

implement compression (Microsoft Documentation: Data compression Overview)
implementing partitioning (Microsoft Documentation: Overview of Data partitioning strategies)
implement sharding (Microsoft Documentation: What is Sharding pattern, Adding a shard using Elastic Database tools)
implement different table geometries with Azure Synapse Analytics pools (Microsoft Documentation: Defining Spatial Types – geometry (Transact-SQL), Table data types for dedicated SQL pool (formerly SQL DW) in Azure Synapse Analytics)
implementing data redundancy (Microsoft Documentation: Overview of Azure Storage redundancy, Process of how storage account is replicated)
implement distributions (Microsoft Documentation: Distributions overview, Table distribution Examples)
implementing data archiving (Microsoft Documentation: Archive on-premises data to the cloud, Rehydrate blob data)

Implement logical data structures

build a temporal data solution (Microsoft Documentation: Creating a system-versioned temporal table)
building a slowly changing dimension
build a logical folder structure
build external tables (Microsoft Documentation: Using external tables with Synapse SQL, Create and alter external tables in Azure Storage or Azure Data Lake)
implement file and folder structures for efficient querying and data pruning (Microsoft Documentation: Query multiple files or folders, Query folders and multiple files)

Implement the serving layer

deliver data in a relational star schema
deliver data in Parquet files (Microsoft Documentation: Parquet format in Azure Data Factory
maintain metadata (Microsoft Documentation: Preserve metadata and ACLs using copy activity in Azure Data Factory)
implement a dimensional hierarchy (Microsoft Documentation: Create and manage hierarchies)

Topic 2: Design and Develop Data Processing

Ingest and transform data

transform data by using Apache Spark (Microsoft Documentation: Transform data in the cloud by using a Spark activity)
transform data by using Transact-SQL (Microsoft Documentation: SQL Transformation)
transforming data by using Data Factory (Microsoft Documentation: Transform data in Azure Data Factory)
transform data by using Azure Synapse Pipelines (Microsoft Documentation: Transform data using mapping data flows)
transform data by using Stream Analytics
cleanse data (Microsoft Documentation: Overview of Data Cleansing, Clean Missing Data module)
split data (Microsoft Documentation: Split Data Overview, Split Data module)
shred JSON
encode and decode data
configure error handling for the transformation (Microsoft Documentation: Handle SQL truncation error rows in Data Factory, Troubleshoot mapping data flows in Azure Data Factory)
normalize and denormalize values (Microsoft Documentation: Overview of Normalize Data module, What is Normalize Data?)
transform data by using Scala (Microsoft Documentation: Extract, transform, and load data by using Azure Databricks)
perform data exploratory analysis (Microsoft Documentation: Query data in Azure Data Explorer Web UI)

Design and develop a batch processing solution

develop batch processing solutions by using Data Factory, Data Lake, Spark, Azure Synapse Pipelines, PolyBase, and Azure Databricks (Microsoft Documentation: What is Batch processing, Choosing a batch processing technology in Azure, Process large-scale datasets)
create data pipelines (Microsoft Documentation: Creating a pipeline, Build a data pipeline)
design and implement incremental data loads (Microsoft Documentation: Load data from Azure SQL Database to Azure Blob storage)
design and develop slowly changing dimensions
handle security and compliance requirements (Microsoft Documentation: Azure security baseline for Batch, Azure Policy Regulatory Compliance controls)
scale resources (Microsoft Documentation: Create an automatic formula for scaling compute nodes)
configure the batch size (Microsoft Documentation: Selecting VM size and image for compute nodes)
design and create tests for data pipelines
integrate Jupyter/IPython notebooks into a data pipeline (Microsoft Documentation: Set up a Python development environment for Azure Machine Learning, Azure Machine Learning with Jupyter Notebooks)
handle duplicate data (Microsoft Documentation: Handling duplicate data in Azure Data Explorer, Removing Duplicate Rows module)
handling missing data (Microsoft Documentation: Cleaning Missing Data module)
handle late-arriving data (Microsoft Documentation: Understand time handling in Azure Stream Analytics, Time Skew Policies)
upsert data
regress to a previous state (Microsoft Documentation: Monitor Batch solutions by counting tasks and nodes by state)
design and configure exception handling (Microsoft Documentation: Azure Batch error handling and detection)
configure batch retention (Microsoft Documentation: Azure Batch best practices)
design a batch processing solution (Microsoft Documentation: Overview of Batch processing)
debug Spark jobs by using the Spark UI (Microsoft Documentation: Debug Apache Spark jobs running on Azure HDInsight)

Design and develop a stream processing solution

develop a stream processing solution by using Stream Analytics, Azure Databricks, and Azure Event Hubs (Microsoft Documentation: Stream processing with Azure Databricks, Stream data into Azure Databricks using Event Hubs)
process data by using Spark structured streaming (Microsoft Documentation: What is Structured Streaming? Apache Spark Structured Streaming)
monitor for performance and functional regressions (Microsoft Documentation: Stream Analytics job monitoring and process to monitor queries)
design and create windowed aggregates (Microsoft Documentation: Stream Analytics windowing functions, Windowing functions)
handle schema drift (Microsoft Documentation: Schema drift in mapping data flow)
process time-series data (Microsoft Documentation: Time handling in Azure Stream Analytics, What is Time series solutions?)
processing across partitions (Microsoft Documentation: Stream processing with Azure Stream Analytics, Optimize processing with Azure Stream Analytics using repartitioning)
the process within one partition
configure checkpoints/watermarking during processing (Microsoft Documentation: Checkpoint and replay concepts, Example of watermarks)
scale resources (Microsoft Documentation: Streaming Units, Scale an Azure Stream Analytics job)
design and create tests for data pipelines (Microsoft Documentation: Testing live data locally, Test an Azure Stream Analytics job)
optimize pipelines for analytical or transactional purposes (Microsoft Documentation: Query parallelization in Azure Stream Analytics, Optimize processing with Azure Stream Analytics using repartitioning)
handle interruptions (Microsoft Documentation: Stream Analytics job reliability during service updates)
design and configure exception handling (Microsoft Documentation: Output error policy, User-defined functions in Azure Stream Analytics)
upsert data (Microsoft Documentation: Azure Stream Analytics output to Azure Cosmos DB)
replay archived stream data (Microsoft Documentation: Checkpoint and replay concepts)
design a stream processing solution (Microsoft Documentation: Stream processing)

Manage batches and pipelines

trigger batches (Microsoft Documentation: Trigger a Batch job using Azure Functions)
handle failed batch loads (Microsoft Documentation: Check for pool and node errors)
validate batch loads (Microsoft Documentation: Error checking for job and task)
manage data pipelines in Data Factory/Synapse Pipelines (Microsoft Documentation: Managing the mapping data flow graph)
schedule data pipelines in Data Factory/Synapse Pipelines (Microsoft Documentation: Create a trigger)
implement version control for pipeline artifacts (Microsoft Documentation: Source control in Azure Data Factory)
manage Spark jobs in a pipeline (Microsoft Documentation: Monitor a pipeline)
Topic 3: Design and Implement Data Security
Design security for data policies and standards
- design data encryption for data at rest and in transit (Microsoft Documentation: Azure Data Encryption at rest, Data in transit)
- designing a data auditing strategy (Microsoft Documentation: Auditing for Azure SQL Database and Azure Synapse Analytics)
- design a data masking strategy (Microsoft Documentation: Overview of Dynamic data masking)
- design for data privacy
- designing a data retention policy (Microsoft Documentation: Understand data retention in Azure Time Series Insights Gen1)
- design to purge data based on business requirements (Microsoft Documentation: Enable data purge, Overview of Data purge)
- designing Azure role-based access control (Azure RBAC) and POSIX-like Access Control List (ACL) for Data Lake Storage Gen2 (Microsoft Documentation: Access control model in Azure Data Lake Storage Gen2, Access control lists (ACLs))
- design row-level and column-level security (Microsoft Documentation: Overview of Column-level security)
Implement data security
- implement data masking (Microsoft Documentation: SQL Database dynamic data masking with the Azure portal)
- encrypt data at rest and in motion (Microsoft Documentation: Transparent data encryption for SQL Database, SQL Managed Instance, and Azure Synapse Analytics)
- implement row-level and column-level security
- implementing Azure RBAC (Microsoft Documentation: Azure portal for assigning an Azure role for access to blob and queue data)
- implement POSIX-like ACLs for Data Lake Storage Gen2 (Microsoft Documentation: PowerShell for managing directories and files in Azure Data Lake Storage Gen2)
- implement a data retention policy (Microsoft Documentation: Configuring retention in Azure Time Series Insights Gen1)
- implementing a data auditing strategy (Microsoft Documentation: Auditing for Azure SQL Database and Azure Synapse Analytics)
- manage identities, keys, and secrets across different data platform technologies
- implement secure endpoints (private and public) (Microsoft Documentation: Private endpoints for Azure Storage, Azure SQL Managed Instance securely with public endpoints, Configure public endpoint)
- implement resource tokens in Azure Databricks (Microsoft Documentation: Authentication using Azure Databricks personal access tokens)
- load a DataFrame with sensitive information (Microsoft Documentation: Overview of DataFrames)
- write encrypted data to tables or Parquet files
- manage sensitive information (Microsoft Documentation: Explaining Security Control: Data Protection)
Topic 4: Monitor and Optimize Data Storage and Data Processing
Monitor data storage and data processing
- implement logging used by Azure Monitor (Microsoft Documentation: Overview of Azure Monitor Logs, Collecting custom logs with Log Analytics agent in Azure Monitor)
- configure monitoring services (Microsoft Documentation: Monitoring Azure resources with Azure Monitor, Define Enable VM insights)
- measure performance of data movement (Microsoft Documentation: Overview of Copy activity performance and scalability)
- monitor and update statistics about data across a system (Microsoft Documentation: Statistics in Synapse SQL, UPDATE STATISTICS)
- monitor data pipeline performance (Microsoft Documentation: Monitor and Alert Data Factory by using Azure Monitor)
- measure query performance (Microsoft Documentation: Query Performance Insight for Azure SQL Database)
- monitor cluster performance (Microsoft Documentation: Monitor cluster performance in Azure HDInsight)
- understand custom logging options (Microsoft Documentation: Collecting custom logs with Log Analytics agent in Azure Monitor)
- schedule and monitor pipeline tests (Microsoft Documentation: Monitor and manage Azure Data Factory pipelines by using the Azure portal and PowerShell)
- interpret Azure Monitor metrics and logs (Microsoft Documentation: Overview of Azure Monitor Metrics, Define Azure platform logs)
- interpret a Spark directed acyclic graph (DAG)
Optimize and troubleshoot data storage and data processing
- compact small files (Microsoft Documentation: Explain Auto Optimize)
- rewrite user-defined functions (UDFs) (Microsoft Documentation: Process of modifying User-defined Functions)
- handle skew in data (Microsoft Documentation: Resolve data-skew problems by using Azure Data Lake Tools for Visual Studio)
- handle data spill
- tune shuffle partitions
- find shuffling in a pipeline
- optimize resource management
- tune queries by using indexers (Microsoft Documentation: Automatic tuning in Azure SQL Database and Azure SQL Managed Instance)
- tune queries by using cache (Microsoft Documentation: Performance tuning with a result set caching)
- optimize pipelines for analytical or transactional purposes (Microsoft Documentation: What is Hyperspace?)
- optimize pipeline for descriptive versus analytical workloads (Microsoft Documentation: Optimize Apache Spark jobs in Azure Synapse Analytics)
- troubleshoot a failed spark job (Microsoft Documentation: Troubleshoot Apache Spark by using Azure HDInsight, Troubleshoot a slow or failing job on an HDInsight cluster)
- troubleshoot a failed pipeline run (Microsoft Documentation: Troubleshoot pipeline orchestration and triggers in Azure Data Factory)

DP-203: Data Engineering on Microsoft Azure

Links: Exam | Certification

Level: Advanced

Job roles: Data Engineer

Technologies & products: Azure, Azure Portal, Storage, Virtual Machines, Cloud Shell, Data Factory, Synapse Analytics, Databricks, Stream Analytics

Topics:

Design and implement data storage
Design and develop data processing
Design and implement data security
Monitor and optimize data storage and data processing

Description:

A candidate for the Azure Data Engineer Associate certification should have subject matter expertise integrating, transforming, and consolidating data from various structured and unstructured data systems into structures that are suitable for building analytics solutions.

Responsibilities for this role include helping stakeholders understand the data through exploration, building and maintaining secure and compliant data processing pipelines by using different tools and techniques. This professional uses various Azure data services and languages to store and produce cleansed and enhanced datasets for analysis.

An Azure data engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints. This professional deals with unanticipated issues swiftly and minimizes data loss. An Azure data engineer also designs, implements, monitors, and optimizes data platforms to meet the data pipeline needs.

A candidate for this certification must have solid knowledge of data processing languages, such as SQL, Python, or Scala, and they need to understand parallel processing and data architecture patterns.

Videos:

Exam Study Guides:

LinkedIn:

Other courses:

Udemy: DP-203 - Data Engineering on Microsoft Azure 2021
Coursera: Microsoft Azure Data Engineering Associate DP-203 Exam Prep Specialization
Pluralsight: Microsoft Exam DP-203 Data Engineering on Microsoft Azure

Tutorial for Microsoft Certification on Data Engineering DP-203