Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks

Girten, Will

  • 出版商: Packt Publishing
  • 出版日期: 2024-10-31
  • 售價: $1,840
  • 貴賓價: 9.5$1,748
  • 語言: 英文
  • 頁數: 246
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1801073236
  • ISBN-13: 9781801073233
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

Get up to speed with the Databricks Data Intelligence Platform to build and scale modern data applications, leveraging the latest advancements in data engineering

Key Features:

- Learn how to work with real-time data using Delta Live Tables

- Unlock insights into the performance of data pipelines using Delta Live Tables

- Apply your knowledge to Unity Catalog for robust data security and governance

- Purchase of the print or Kindle book includes a free PDF eBook

Book Description:

With so many tools to choose from in today's data engineering development stack as well as operational complexity, this often overwhelms data engineers, causing them to spend less time gleaning value from their data and more time maintaining complex data pipelines. Guided by a lead specialist solutions architect at Databricks with 10+ years of experience in data and AI, this book shows you how the Delta Live Tables framework simplifies data pipeline development by allowing you to focus on defining input data sources, transformation logic, and output table destinations.

This book gives you an overview of the Delta Lake format, the Databricks Data Intelligence Platform, and the Delta Live Tables framework. It teaches you how to apply data transformations by implementing the Databricks medallion architecture and continuously monitor the data quality of your pipelines. You'll learn how to handle incoming data using the Databricks Auto Loader feature and automate real-time data processing using Databricks workflows. You'll master how to recover from runtime errors automatically.

By the end of this book, you'll be able to build a real-time data pipeline from scratch using Delta Live Tables, leverage CI/CD tools to deploy data pipeline changes automatically across deployment environments, and monitor, control, and optimize cloud costs.

What You Will Learn:

- Deploy near-real-time data pipelines in Databricks using Delta Live Tables

- Orchestrate data pipelines using Databricks workflows

- Implement data validation policies and monitor/quarantine bad data

- Apply slowly changing dimensions (SCD), Type 1 and 2, data to lakehouse tables

- Secure data access across different groups and users using Unity Catalog

- Automate continuous data pipeline deployment by integrating Git with build tools such as Terraform and Databricks Asset Bundles

Who this book is for:

This book is for data engineers looking to streamline data ingestion, transformation, and orchestration tasks. Data analysts responsible for managing and processing lakehouse data for analysis, reporting, and visualization will also find this book beneficial. Additionally, DataOps/DevOps engineers will find this book helpful for automating the testing and deployment of data pipelines, optimizing table tasks, and tracking data lineage within the lakehouse. Beginner-level knowledge of Apache Spark and Python is needed to make the most out of this book.

Table of Contents

- An Introduction to Delta Live Tables

- Applying Data Transformations Using Delta Live Tables

- Managing Data Quality Using Delta Live Tables

- Scaling DLT Pipelines

- Mastering Data Governance in the Lakehouse with Unity Catalog

- Managing Data Locations in Unity Catalog

- Viewing Data Lineage Using Unity Catalog

- Deploying, Maintaining, and Administrating DLT Pipelines Using Terraform

- Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment

- Monitoring Data Pipelines in Production

商品描述(中文翻譯)

隨著 Databricks 數據智慧平台的快速掌握,構建和擴展現代數據應用程序,利用數據工程的最新進展。

主要特點:
- 學習如何使用 Delta Live Tables 處理實時數據
- 利用 Delta Live Tables 獲取數據管道性能的洞察
- 將您的知識應用於 Unity Catalog,以實現強大的數據安全性和治理
- 購買印刷版或 Kindle 版書籍可獲得免費 PDF 電子書

書籍描述:
在當今數據工程開發堆棧中有如此多的工具可供選擇,加上操作的複雜性,這常常使數據工程師感到不知所措,導致他們花費更少的時間從數據中獲取價值,而更多的時間用於維護複雜的數據管道。本書由一位在數據和人工智慧領域擁有超過 10 年經驗的 Databricks 首席專家解決方案架構師指導,展示了 Delta Live Tables 框架如何簡化數據管道的開發,讓您專注於定義輸入數據源、轉換邏輯和輸出表目的地。

本書概述了 Delta Lake 格式、Databricks 數據智慧平台和 Delta Live Tables 框架。它教您如何通過實施 Databricks 銅、銀、金架構來應用數據轉換,並持續監控管道的數據質量。您將學會如何使用 Databricks Auto Loader 功能處理進來的數據,並使用 Databricks 工作流自動化實時數據處理。您將掌握如何自動從運行時錯誤中恢復。

在本書結束時,您將能夠從零開始使用 Delta Live Tables 構建實時數據管道,利用 CI/CD 工具自動部署數據管道變更到不同的部署環境,並監控、控制和優化雲端成本。

您將學到的內容:
- 使用 Delta Live Tables 在 Databricks 中部署近實時數據管道
- 使用 Databricks 工作流協調數據管道
- 實施數據驗證政策並監控/隔離不良數據
- 將緩慢變化維度(SCD)、類型 1 和 2 數據應用於湖屋表
- 使用 Unity Catalog 確保不同群組和用戶的數據訪問安全
- 通過將 Git 與 Terraform 和 Databricks 資產包等構建工具集成,自動化持續數據管道部署

本書適合對象:
本書適合希望簡化數據攝取、轉換和協調任務的數據工程師。負責管理和處理湖屋數據以進行分析、報告和可視化的數據分析師也會發現本書的價值。此外,DataOps/DevOps 工程師將發現本書對於自動化數據管道的測試和部署、優化表任務以及追蹤湖屋中的數據血緣非常有幫助。讀者需具備 Apache Spark 和 Python 的初級知識,以充分利用本書。

目錄:
- Delta Live Tables 簡介
- 使用 Delta Live Tables 應用數據轉換
- 使用 Delta Live Tables 管理數據質量
- 擴展 DLT 管道
- 在湖屋中掌握數據治理與 Unity Catalog
- 在 Unity Catalog 中管理數據位置
- 使用 Unity Catalog 查看數據血緣
- 使用 Terraform 部署、維護和管理 DLT 管道
- 利用 Databricks 資產包簡化數據管道部署
- 監控生產中的數據管道