Big Data Processing with Apache Spark: Efficiently tackle large datasets and big data analysis with Spark and Python

Manuel Ignacio Franco Galeano

商品描述

No need to spend hours ploughing through endless data – let Spark, one of the fastest big data processing engines available, do the hard work for you.

Key Features

  • Get up and running with Apache Spark and Python
  • Integrate Spark with AWS for real-time analytics
  • Apply processed data streams to machine learning APIs of Apache Spark

Book Description

Processing big data in real time is challenging due to scalability, information consistency, and fault-tolerance. This book teaches you how to use Spark to make your overall analytical workflow faster and more efficient. You'll explore all core concepts and tools within the Spark ecosystem, such as Spark Streaming, the Spark Streaming API, machine learning extension, and structured streaming.

You'll begin by learning data processing fundamentals using Resilient Distributed Datasets (RDDs), SQL, Datasets, and Dataframes APIs. After grasping these fundamentals, you'll move on to using Spark Streaming APIs to consume data in real time from TCP sockets, and integrate Amazon Web Services (AWS) for stream consumption.

By the end of this book, you'll not only have understood how to use machine learning extensions and structured streams but you'll also be able to apply Spark in your own upcoming big data projects.

What you will learn

  • Write your own Python programs that can interact with Spark
  • Implement data stream consumption using Apache Spark
  • Recognize common operations in Spark to process known data streams
  • Integrate Spark streaming with Amazon Web Services (AWS)
  • Create a collaborative filtering model with the movielens dataset
  • Apply processed data streams to Spark machine learning APIs

Who this book is for

Data Processing with Apache Spark is for you if you are a software engineer, architect, or IT professional who wants to explore distributed systems and big data analytics. Although you don�t need any knowledge of Spark, prior experience of working with Python is recommended.

Table of Contents

  1. Introduction to Spark Distributed Processing
  2. Introduction to Spark Streaming
  3. Spark Streaming Integration with AWS
  4. Spark Streaming, ML, and Windowing Operations

商品描述(中文翻譯)

不需要花費數小時翻閱無盡的數據 – 讓 Spark,這個最快的大數據處理引擎之一,為您完成艱難的工作。

主要特點
- 使用 Apache Spark 和 Python 快速上手
- 將 Spark 與 AWS 整合以進行即時分析
- 將處理過的數據流應用於 Apache Spark 的機器學習 API

書籍描述
即時處理大數據具有挑戰性,因為需要考慮可擴展性、信息一致性和容錯性。本書教您如何使用 Spark 使整體分析工作流程更快、更高效。您將探索 Spark 生態系統中的所有核心概念和工具,例如 Spark Streaming、Spark Streaming API、機器學習擴展和結構化流。

您將首先學習使用彈性分佈式數據集(RDDs)、SQL、數據集和數據框 API 的數據處理基礎知識。在掌握這些基礎後,您將進一步使用 Spark Streaming API 從 TCP 套接字即時消耗數據,並整合 Amazon Web Services (AWS) 以進行流消耗。

在本書結束時,您不僅會了解如何使用機器學習擴展和結構化流,還能夠在即將到來的大數據項目中應用 Spark。

您將學到的內容
- 編寫自己的 Python 程式以與 Spark 互動
- 使用 Apache Spark 實現數據流消耗
- 辨識 Spark 中處理已知數據流的常見操作
- 將 Spark 流整合到 Amazon Web Services (AWS)
- 使用 movielens 數據集創建協同過濾模型
- 將處理過的數據流應用於 Spark 機器學習 API

本書適合誰
《使用 Apache Spark 進行數據處理》適合您,如果您是希望探索分佈式系統和大數據分析的軟體工程師、架構師或 IT 專業人員。雖然您不需要具備 Spark 的知識,但建議您具備使用 Python 的經驗。

目錄
1. Spark 分佈式處理簡介
2. Spark Streaming 簡介
3. Spark Streaming 與 AWS 的整合
4. Spark Streaming、機器學習和窗口操作