Supervised Machine Learning for Text Analysis in R

Hvitfeldt, Emil, Silge, Julia

相關主題

商品描述

Text data is important for many domains, from healthcare to marketing to the digital humanities, but specialized approaches are necessary to create features for machine learning from language. Supervised Machine Learning for Text Analysis in R explains how to preprocess text data for modeling, train models, and evaluate model performance using tools from the tidyverse and tidymodels ecosystem. Models like these can be used to make predictions for new observations, to understand what natural language features or characteristics contribute to differences in the output, and more. If you are already familiar with the basics of predictive modeling, use the comprehensive, detailed examples in this book to extend your skills to the domain of natural language processing.

This book provides practical guidance and directly applicable knowledge for data scientists and analysts who want to integrate unstructured text data into their modeling pipelines. Learn how to use text data for both regression and classification tasks, and how to apply more straightforward algorithms like regularized regression or support vector machines as well as deep learning approaches. Natural language must be dramatically transformed to be ready for computation, so we explore typical text preprocessing and feature engineering steps like tokenization and word embeddings from the ground up. These steps influence model results in ways we can measure, both in terms of model metrics and other tangible consequences such as how fair or appropriate model results are.

商品描述(中文翻譯)

文本數據對於許多領域都非常重要,從醫療保健到市場營銷再到數字人文學,但需要專門的方法來從語言中創建機器學習的特徵。《R語言中的監督式文本分析機器學習》解釋了如何對文本數據進行預處理以進行建模,使用tidyverse和tidymodels生態系統中的工具來訓練模型並評估模型性能。這些模型可以用於對新觀察進行預測,了解自然語言特徵或特性對輸出差異的貢獻,以及更多其他用途。如果您已經熟悉預測建模的基礎知識,可以使用本書中詳細的實例來擴展您在自然語言處理領域的技能。

本書為數據科學家和分析師提供了實用的指導和可直接應用的知識,他們希望將非結構化文本數據整合到他們的建模流程中。學習如何將文本數據應用於回歸和分類任務,以及如何應用更簡單的算法,如正則化回歸或支持向量機,以及深度學習方法。自然語言必須經過顯著的轉換才能進行計算,因此我們從頭開始探索了典型的文本預處理和特徵工程步驟,如分詞和詞嵌入。這些步驟以各種方式影響模型結果,我們可以通過模型指標和其他具體後果(例如模型結果的公平性或適當性)來衡量。

作者簡介

Emil Hvitfeldt is a clinical data analyst working in healthcare, and an adjunct professor at American University where he is teaching statistical machine learning with tidymodels. He is also an open source R developer and author of the textrecipes package.

Julia Silge is a data scientist and software engineer at RStudio PBC where she works on open source modeling tools. She is an author, an international keynote speaker and educator, and a real-world practitioner focusing on data analysis and machine learning practice.

作者簡介(中文翻譯)

Emil Hvitfeldt 是一位在醫療領域工作的臨床數據分析師,也是美國大學的兼職教授,他正在使用 tidymodels 教授統計機器學習。他還是一位開源 R 開發者,並且是 textrecipes 套件的作者。

Julia Silge 是 RStudio PBC 的數據科學家和軟體工程師,她在該公司負責開源建模工具的開發。她是一位作者、國際主題演講者和教育工作者,同時也是一位實踐者,專注於數據分析和機器學習實踐。