Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

Narayanan, Pavan Kumar

  • 出版商: Apress
  • 出版日期: 2024-09-28
  • 售價: $2,320
  • 貴賓價: 9.5$2,204
  • 語言: 英文
  • 頁數: 636
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 9798868806018
  • ISBN-13: 9798868806018
  • 相關分類: Python程式語言Machine Learning
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

This book covers modern data engineering functions and important Python libraries, to help you develop state-of-the-art ML pipelines and integration code.

The book begins by explaining data analytics and transformation, delving into the Pandas library, its capabilities, and nuances. It then explores emerging libraries such as Polars and CuDF, providing insights into GPU-based computing and cutting-edge data manipulation techniques. The text discusses the importance of data validation in engineering processes, introducing tools such as Great Expectations and Pandera to ensure data quality and reliability. The book delves into API design and development, with a specific focus on leveraging the power of FastAPI. It covers authentication, authorization, and real-world applications, enabling you to construct efficient and secure APIs using FastAPI. Also explored is concurrency in data engineering, examining Dask's capabilities from basic setup to crafting advanced machine learning pipelines. The book includes development and delivery of data engineering pipelines using leading cloud platforms such as AWS, Google Cloud, and Microsoft Azure. The concluding chapters concentrate on real-time and streaming data engineering pipelines, emphasizing Apache Kafka and workflow orchestration in data engineering. Workflow tools such as Airflow and Prefect are introduced to seamlessly manage and automate complex data workflows.

What sets this book apart is its blend of theoretical knowledge and practical application, a structured path from basic to advanced concepts, and insights into using state-of-the-art tools. With this book, you gain access to cutting-edge techniques and insights that are reshaping the industry. This book is not just an educational tool. It is a career catalyst, and an investment in your future as a data engineering expert, poised to meet the challenges of today's data-driven world.

 

What You Will Learn

  • Elevate your data wrangling jobs by utilizing the power of both CPU and GPU computing, and learn to process data using Pandas 2.0, Polars, and CuDF at unprecedented speeds
  • Design data validation pipelines, construct efficient data service APIs, develop real-time streaming pipelines and master the art of workflow orchestration to streamline your engineering projects
  • Leverage concurrent programming to develop machine learning pipelines and get hands-on experience in development and deployment of machine learning pipelines across AWS, GCP, and Azure

 

Who This Book Is For

Data analysts, data engineers, data scientists, machine learning engineers, and MLOps specialists

商品描述(中文翻譯)

這本書涵蓋了現代數據工程的功能和重要的 Python 函式庫,幫助您開發最先進的機器學習管道和整合代碼。

本書首先解釋數據分析和轉換,深入探討 Pandas 函式庫及其功能和細微差別。接著探索新興的函式庫,如 Polars 和 CuDF,提供有關基於 GPU 的計算和尖端數據操作技術的見解。文本討論了數據驗證在工程過程中的重要性,介紹了如 Great Expectations 和 Pandera 等工具,以確保數據的質量和可靠性。本書深入探討 API 設計和開發,特別關注利用 FastAPI 的強大功能。內容涵蓋身份驗證、授權和實際應用,使您能夠使用 FastAPI 構建高效且安全的 API。此外,還探討了數據工程中的併發性,檢視 Dask 的能力,從基本設置到構建先進的機器學習管道。本書包括使用 AWS、Google Cloud 和 Microsoft Azure 等主要雲平台開發和交付數據工程管道。最後幾章專注於實時和流數據工程管道,強調 Apache Kafka 和數據工程中的工作流程編排。引入了如 Airflow 和 Prefect 等工作流程工具,以無縫管理和自動化複雜的數據工作流程。

這本書的特點在於理論知識與實踐應用的結合,從基本到高級概念的結構化路徑,以及使用尖端工具的見解。通過這本書,您將獲得重塑行業的尖端技術和見解。這本書不僅僅是一本教育工具,它是您作為數據工程專家的職業催化劑,是對您未來的投資,讓您能夠應對當今數據驅動世界的挑戰。

您將學到的內容:
- 利用 CPU 和 GPU 計算的力量提升數據處理工作,學習以空前的速度使用 Pandas 2.0、Polars 和 CuDF 處理數據
- 設計數據驗證管道,構建高效的數據服務 API,開發實時流管道,並掌握工作流程編排的藝術,以簡化您的工程項目
- 利用併發編程開發機器學習管道,並獲得在 AWS、GCP 和 Azure 上開發和部署機器學習管道的實踐經驗

本書適合的讀者:
數據分析師、數據工程師、數據科學家、機器學習工程師和 MLOps 專家。

作者簡介

Pavan Kumar Narayanan has an extensive and diverse career in the information technology industry, with a primary focus on the data engineering and machine learning domains. Throughout his professional journey, he has consistently delivered solutions in environments characterized by heterogeneity and complexity. His experience spans a broad spectrum, encompassing traditional data warehousing projects following waterfall methodologies and extending to contemporary integrations that involve APIs and message-based systems. Pavan has made substantial contributions to large-scale data integrations for applications in data science and machine learning. At the forefront of these endeavors, he has played a key role in delivering sophisticated data products and solutions, employing a versatile mix of both traditional and agile approaches. Currently employed with Ether Infinitum LLC, Sheridan, WY, Pavan Kumar Narayanan continues to bring his wealth of experience to the forefront of the data engineering and machine learning landscape.

作者簡介(中文翻譯)

Pavan Kumar Narayanan 在資訊科技產業擁有廣泛且多樣的職業生涯,主要專注於資料工程和機器學習領域。在他的職業旅程中,他始終在異質性和複雜性特徵的環境中提供解決方案。他的經驗涵蓋了廣泛的範疇,包括遵循瀑布式方法的傳統資料倉儲專案,以及涉及 API 和基於訊息系統的現代整合。Pavan 在資料科學和機器學習應用的大規模資料整合方面做出了重要貢獻。在這些努力的最前沿,他在交付複雜的資料產品和解決方案中扮演了關鍵角色,運用傳統和敏捷方法的多元組合。目前,Pavan Kumar Narayanan 在位於懷俄明州雪莉丹的 Ether Infinitum LLC 工作,持續將他的豐富經驗帶入資料工程和機器學習的前沿。