Statistical Significance Testing in Information Retrieval: Theory and Practice (Synthesis Lectures on Information Concepts, Retrieval, and Services)
暫譯: 資訊檢索中的統計顯著性檢驗:理論與實踐(資訊概念、檢索與服務綜合講座)

Ben Carterette

  • 出版商: Morgan & Claypool
  • 出版日期: 2018-08-30
  • 售價: $1,620
  • 貴賓價: 9.5$1,539
  • 語言: 英文
  • 頁數: 120
  • 裝訂: Paperback
  • ISBN: 1627055274
  • ISBN-13: 9781627055277
  • 海外代購書籍(需單獨結帳)

商品描述

The past 20 years have seen a great improvement in the rigor of information retrieval experimentation, due primarily to two factors: high-quality, public, portable test collections such as those produced by TREC (the Text REtrieval Conference), and the increased practice of statistical hypothesis testing to determine whether measured improvements can be ascribed to something other than random chance. Together these create a very useful standard for reviewers, program commit- tees, and journal editors; work in information retrieval (IR) increasingly cannot be published unless it has been evaluated using a well-constructed test collection and shown to produce a statistically significant improvement over a good baseline. But, as the saying goes, any tool sharp enough to be useful is also sharp enough to be dangerous. Statistical tests of significance are widely misunderstood. Most researchers and developers treat them as a “black box": evaluation results go in and a p-value comes out. But because significance is such an important factor in determining what research directions to explore and what is published, using p-values obtained without thought can have consequences for everyone doing research in IR. Ioannidis has argued that the main consequence in the biomedical sciences is that most published research findings are false; could that be the case in IR as well? Our goal with this work is to help researchers and developers gain a better understanding of how tests work and how they should be interpreted so that they can both use them more effectively in their day-to-day work as well as better understand how to interpret them when reading the work of others. We will do this primarily with three tools: (a) mathematical analysis; (b) simulation; and (c) experimentation with TREC data - because of the availability of TREC data, IR as a field is uniquely positioned to be able to evaluate significance testing in the presence of a wide variety of “failed" experiments.

商品描述(中文翻譯)

過去20年來,資訊檢索實驗的嚴謹性有了很大的改善,這主要歸因於兩個因素:高品質的公共可攜式測試資料集,例如由TREC(文本檢索會議)所產生的資料集,以及統計假設檢定的增加應用,以確定所測量的改進是否可以歸因於隨機機會以外的因素。這兩者共同為審稿人、程式委員會和期刊編輯創造了一個非常有用的標準;資訊檢索(IR)領域的工作越來越不能發表,除非它已經使用良好構建的測試資料集進行評估,並顯示出相對於良好基準的統計顯著性改進。然而,正如諺語所說,任何足夠有用的工具也足夠危險。統計顯著性檢定常常被誤解。大多數研究人員和開發者將其視為一個“黑箱”:評估結果進去,p值出來。但因為顯著性在決定研究方向和發表內容中是如此重要的因素,隨意使用獲得的p值可能對所有從事IR研究的人都有影響。Ioannidis曾主張,在生物醫學科學中,主要的後果是大多數已發表的研究結果都是錯誤的;在IR領域是否也會如此?我們這項工作的目標是幫助研究人員和開發者更好地理解檢定的運作方式及其解釋,以便他們能在日常工作中更有效地使用這些檢定,並在閱讀他人工作時更好地理解如何解釋它們。我們將主要使用三種工具:(a)數學分析;(b)模擬;以及(c)使用TREC數據進行實驗——由於TREC數據的可用性,IR作為一個領域獨特地能夠在各種“失敗”實驗的情況下評估顯著性檢定。