Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up Using Pyspark (Paperback)
暫譯: 使用 Pyspark 的數據算法:擴展的食譜與設計模式
Parsian, Mahmoud
- 出版商: O'Reilly
- 出版日期: 2022-05-17
- 定價: $2,780
- 售價: 9.5 折 $2,641
- 語言: 英文
- 頁數: 435
- 裝訂: Quality Paper - also called trade paper
- ISBN: 1492082384
- ISBN-13: 9781492082385
-
相關分類:
Spark、Algorithms-data-structures、Design Pattern
立即出貨 (庫存=1)
買這商品的人也買了...
相關主題
商品描述
Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark.
In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You'll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script.
With this book, you will:
Learn how to select Spark transformations for optimized solutions
Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions()
- Understand data partitioning for optimized queries
- Build and apply a model using PySpark design patterns
- Apply motif-finding algorithms to graph data
- Analyze graph data by using the GraphFrames API
- Apply PySpark algorithms to clinical and genomics data
- Learn how to use and apply feature engineering in ML algorithms
- Understand and use practical and pragmatic data design patterns
From the Preface
Spark has become the de facto standard for large-scale data analytics. I have been using and teaching Spark since its inception nine years ago, and I have seen tremendous improvements in Extract, Transform, Load (ETL) processes, distributed algorithm development, and large-scale data analytics. I started using Spark with Java, but I found that while the code is pretty stable, you have to write long lines of code, which can become unreadable. For this book, I decided to use PySpark (a Python API for Spark) because it is easier to express the power of Spark in Python: the code is short, readable, and maintainable. PySpark is powerful but simple to use, and you can express any ETL or distributed algorithm in it with a simple set of transformations and actions.
Why I Wrote This Book
This is an introductory book about data analysis using PySpark. The book consists of a set of guidelines and examples intended to help software and data engineers solve data problems in the simplest possible way. As you know, there are many ways to solve any data problem: PySpark enables us to write simple code for complex problems. This is the motto I have tried to express in this book: keep it simple and use parameters so that your solution can be reused by other developers. My aim is to teach readers how to think about data and understand its origins and final intended form, as well as showing how to use fundamental data transformation patterns to solve a variety of data problems.
Who This Book Is For
To use this book effectively it will be helpful to know the basics of the Python programming language, such as how to use conditionals (if-then-else), iterate through lists, and define and call functions. However, if your background is in another programming language (such as Java or Scala) and you do not know Python, you will still be able to use the book as I have provided a reasonable introduction to Spark and PySpark.
This book is primarily intended for people who want to analyze large amounts of data and develop distributed algorithms using the Spark engine and PySpark. I have provided simple examples showing how to perform ETL operations and write distributed algorithms in PySpark. The code examples are written in such a way that you can cut and paste them to get the job done easily.
商品描述(中文翻譯)
Apache Spark 的速度、易用性、複雜的分析能力以及多語言支持,使得對這個集群計算框架的實用知識成為數據工程師和數據科學家的必備技能。透過這本實作指南,任何尋求 Spark 入門的人都將學習到使用 PySpark 的實用算法和範例。
在每一章中,作者 Mahmoud Parsian 會向你展示如何利用一組 Spark 轉換和算法來解決數據問題。你將學習如何處理涉及 ETL、設計模式、機器學習算法、數據分區和基因組分析的問題。每個詳細的食譜都包括使用 PySpark 驅動程序和 shell 腳本的 PySpark 算法。
透過這本書,你將:
學習如何選擇 Spark 轉換以獲得最佳解決方案
探索強大的轉換和歸約,包括 reduceByKey()、combineByKey() 和 mapPartitions()
- 理解數據分區以優化查詢
- 使用 PySpark 設計模式構建和應用模型
- 將模式尋找算法應用於圖形數據
- 使用 GraphFrames API 分析圖形數據
- 將 PySpark 算法應用於臨床和基因組數據
- 學習如何在機器學習算法中使用和應用特徵工程
- 理解並使用實用和務實的數據設計模式
從前言
Spark 已成為大規模數據分析的事實標準。我自九年前 Spark 創立以來就一直在使用和教授 Spark,並且見證了在提取、轉換、加載(ETL)過程、分佈式算法開發和大規模數據分析方面的巨大改進。我最初使用 Java 來使用 Spark,但我發現雖然代碼相當穩定,但你必須編寫長行代碼,這可能會變得難以閱讀。對於這本書,我決定使用 PySpark(Spark 的 Python API),因為在 Python 中表達 Spark 的強大功能更容易:代碼簡短、可讀且易於維護。PySpark 功能強大但簡單易用,你可以用一組簡單的轉換和操作來表達任何 ETL 或分佈式算法。
我為什麼寫這本書
這是一本關於使用 PySpark 進行數據分析的入門書。這本書由一組指導方針和範例組成,旨在幫助軟體和數據工程師以最簡單的方式解決數據問題。如你所知,解決任何數據問題的方法有很多種:PySpark 使我們能夠為複雜問題編寫簡單的代碼。這是我在這本書中試圖表達的座右銘:保持簡單並使用參數,以便你的解決方案可以被其他開發者重用。我的目標是教導讀者如何思考數據,理解其來源和最終預期形式,以及展示如何使用基本的數據轉換模式來解決各種數據問題。
這本書適合誰
要有效使用這本書,了解 Python 程式語言的基本知識會很有幫助,例如如何使用條件語句(if-then-else)、遍歷列表以及定義和調用函數。然而,如果你的背景是其他程式語言(如 Java 或 Scala),而你不懂 Python,你仍然可以使用這本書,因為我提供了對 Spark 和 PySpark 的合理介紹。
這本書主要針對希望分析大量數據並使用 Spark 引擎和 PySpark 開發分佈式算法的人。我提供了簡單的範例,展示如何在 PySpark 中執行 ETL 操作和編寫分佈式算法。代碼範例的編寫方式使你可以輕鬆地剪切和粘貼以完成任務。
作者簡介
Mahmoud Parsian, Ph.D. in Computer Science, is a practicing software professional with 30 years of experience as a developer, designer, architect, and author. For the past 15 years, he has been involved in Java server-side, databases, MapReduce, Spark, PySpark, and distributed computing. Dr. Parsian currently leads Illumina's Big Data team, which is focused on large-scale genome analytics and distributed computing by using Spark and PySpark. He leads and develops scalable regression algorithms; DNA sequencing pipelines using Java, MapReduce, PySpark, Spark, and open source tools. He is the author of the following books: Data Algorithms (O'Reilly, 2015), PySpark Algorithms (Amazon.com, 2019), JDBC Recipes (Apress, 2005), JDBC Metadata Recipes (Apress, 2006). Also, Dr. Parsian is an Adjunct Professor at Santa Clara University, teaching Big Data Modeling and Analytics and Machine Learning to MSIS program utilizing Spark, PySpark, Python, and scikit-learn.
作者簡介(中文翻譯)
Mahmoud Parsian,計算機科學博士,是一位具有30年經驗的軟體專業人士,擔任過開發者、設計師、架構師和作者。在過去的15年中,他專注於Java伺服器端、資料庫、MapReduce、Spark、PySpark和分散式計算。Parsian博士目前領導Illumina的Big Data團隊,專注於使用Spark和PySpark進行大規模基因組分析和分散式計算。他負責開發可擴展的回歸演算法;使用Java、MapReduce、PySpark、Spark和開源工具的DNA測序管道。他是以下書籍的作者:Data Algorithms(O'Reilly,2015)、PySpark Algorithms(Amazon.com,2019)、JDBC Recipes(Apress,2005)、JDBC Metadata Recipes(Apress,2006)。此外,Parsian博士還是聖克拉拉大學的兼任教授,教授利用Spark、PySpark、Python和scikit-learn的Big Data建模與分析及機器學習課程。