Validity, Reliability, and Significance: Empirical Methods for Nlp and Data Science (有效性、可靠性與意義:自然語言處理與數據科學的實證方法)

Riezler, Stefan, Hagmann, Michael

  • 出版商: Springer
  • 出版日期: 2024-06-10
  • 售價: $1,940
  • 貴賓價: 9.5$1,843
  • 語言: 英文
  • 頁數: 168
  • 裝訂: Hardcover - also called cloth, retail trade, or trade
  • ISBN: 3031570642
  • ISBN-13: 9783031570643
  • 相關分類: Text-miningData Science
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

This book introduces empirical methods for machine learning with a special focus on applications in natural language processing (NLP) and data science. The authors present problems of validity, reliability, and significance and provide common solutions based on statistical methodology to solve them. The book focuses on model-based empirical methods where data annotations and model predictions are treated as training data for interpretable probabilistic models from the well-understood families of generalized additive models (GAMs) and linear mixed effects models (LMEMs). Based on the interpretable parameters of the trained GAMs or LMEMs, the book presents model-based statistical tests such as a validity test that allows for the detection of circular features that circumvent learning. Furthermore, the book discusses a reliability coefficient using variance decomposition based on random effect parameters of LMEMs. Lastly, a significance test based on the likelihood ratios of nested LMEMs trained on the performance scores of two machine learning models is shown to naturally allow the inclusion of variations in meta-parameter settings into hypothesis testing, and further facilitates a refined system comparison conditional on properties of input data. The book is self-contained with an appendix on the mathematical background of generalized additive models and linear mixed effects models as well as an accompanying webpage with the related R and Python code to replicate the presented experiments. The second edition also features a new hands-on chapter that illustrates how to use the included tools in practical applications.

商品描述(中文翻譯)

本書介紹了機器學習的實證方法,特別聚焦於自然語言處理(NLP)和數據科學的應用。作者提出了有效性、可靠性和顯著性等問題,並提供基於統計方法的常見解決方案來解決這些問題。本書專注於基於模型的實證方法,將數據標註和模型預測視為可解釋的機率模型的訓練數據,這些模型來自於廣為人知的廣義加性模型(GAMs)和線性混合效應模型(LMEMs)家族。基於訓練後的GAMs或LMEMs的可解釋參數,本書提出了基於模型的統計檢驗,例如有效性檢驗,該檢驗可以檢測繞過學習的循環特徵。此外,本書還討論了使用基於LMEMs隨機效應參數的變異分解的可靠性係數。最後,基於嵌套LMEMs的似然比檢驗,該檢驗是基於兩個機器學習模型的性能分數,自然允許將元參數設置的變化納入假設檢驗中,並進一步促進了基於輸入數據屬性的精細系統比較。本書內容完整,附錄中包含了廣義加性模型和線性混合效應模型的數學背景,以及一個伴隨的網頁,提供相關的R和Python代碼以重現所呈現的實驗。第二版還新增了一個實作章節,說明如何在實際應用中使用所包含的工具。

作者簡介

Stefan Riezler is a full professor in the Department of Computational Linguistics at Heidelberg University, Germany since 2010, and also co-opted in Informatics at the Department of Mathematics and Computer Science. He received his Ph.D. (with distinction) in Computational Linguistics from the University of Tübingen in 1998, conducted post-doctoral work at Brown University in 1999, and spent a decade in industry research (Xerox PARC, Google Research). His research focus is on inter-active machine learning for natural language processing problems especially for the application areas of cross-lingual information retrieval and statistical machine trans-lation. He is engaged as an editorial board member of the main journals of the field--Computational Linguistics and Transactions of the Association for Computational Linguistics--and is a regular member of the program committee of various natural language processing and machine learning conferences.He has published more than 100 journal and conference papers in these areas. He also conducts interdisciplinary research as member of the Interdisciplinary Center for Scientific Computing (IWR), for example, on the topic of early prediction of sepsis using machine learning and natural language processing techniques.

Michael Hagmann is a graduate research assistant in the Department of Computational Linguistics at Heidelberg University, Germany, since 2019. He received an M.Sc. in Statistics (with distinction) from the University of Vienna, Austria in 2016, and a Ph.D. in Computational Linguistics from Heidelberg University in 2023. He received an award for the best Master's thesis in Applied Statistics from the Austrian Statistical Society. He has worked as a medical statistician at the medical faculty of Heidelberg University in Mannheim, Germany and in the section for Medical Statistics at the Medical University of Vienna, Austria. His research focus is on statistical methods for data science and, recently, NLP. He has published more than 50 papers in journals for medical research and mathematical statistics.

作者簡介(中文翻譯)

Stefan Riezler 自2010年以來擔任德國海德堡大學計算語言學系的全職教授,並在數學與計算機科學系擔任資訊學的共同委員。他於1998年在圖賓根大學獲得計算語言學的博士學位(優異),1999年在布朗大學進行博士後研究,並在產業研究(Xerox PARC、Google Research)工作了十年。他的研究重點是針對自然語言處理問題的互動式機器學習,特別是在跨語言資訊檢索和統計機器翻譯的應用領域。他擔任該領域主要期刊的編輯委員會成員——《計算語言學》和《計算語言學協會會刊》,並且是各種自然語言處理和機器學習會議的程序委員會的常規成員。他在這些領域發表了超過100篇期刊和會議論文。他還作為科學計算跨學科中心(IWR)的成員進行跨學科研究,例如,利用機器學習和自然語言處理技術進行敗血症的早期預測。

Michael Hagmann 自2019年以來擔任德國海德堡大學計算語言學系的研究生助理。他於2016年在奧地利維也納大學獲得統計學碩士學位(優異),並於2023年在海德堡大學獲得計算語言學博士學位。他因其應用統計的碩士論文獲得奧地利統計學會的最佳碩士論文獎。他曾在德國海德堡大學的醫學院和奧地利維也納醫科大學的醫學統計部門擔任醫學統計師。他的研究重點是數據科學的統計方法,最近也涉及自然語言處理(NLP)。他在醫學研究和數學統計的期刊上發表了超過50篇論文。