Lucene in Action

Erik Hatcher, Otis Gospodnetic

  • 出版商: Manning
  • 出版日期: 2004-12-01
  • 售價: $1,710
  • 貴賓價: 9.5$1,625
  • 語言: 英文
  • 頁數: 456
  • 裝訂: Paperback
  • ISBN: 1932394281
  • ISBN-13: 9781932394283
  • 相關分類: 全文搜尋引擎 Full-text-search
  • 已過版

買這商品的人也買了...

相關主題

商品描述

Descriptions:

Lucene is a gem in the open-source world--a highly scalable, fast search engine. It delivers performance and is disarmingly easy to use. Lucene in Action is the authoritative guide to Lucene. It describes how to index your data, including types you definitely need to know such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, filtering, and highlighting search results.

Lucene powers search in surprising places--in discussion groups at Fortune 100 companies, in commercial issue trackers, in email search from Microsoft, in the Nutch web search engine (that scales to billions of pages). It is used by diverse companies including Akamai, Overture, Technorati, HotJobs, Epiphany, FedEx, Mayo Clinic, MIT, New Scientist Magazine, and many others.

Adding search to your application can be easy. With many reusable examples and good advice on best practices, Lucene in Action shows you how. And if you would like to search through Lucene in Action over the Web, you can do so using Lucene itself as the search engine--take a look at the authors' awesome Search Inside solution. Its results page resembles Google's and provides a novel yet familiar interface to the entire book and book blog.

 

Table of Contents:

foreword xvii
preface xix

acknowledgments xxii
about this book xxv

Part 1 Core Lucene 1

1 Meet Lucene 3
1.1 Evolution of information organization and access 4
1.2 Understanding Lucene 6
What Lucene is 7
What Lucene can do for you 7
History of Lucene 9
Who uses Lucene 10
Lucene ports: Perl, Python, C++, .NET, Ruby 10
1.3 Indexing and searching 10
What is indexing, and why is it important? 10
What is searching? 11
1.4 Lucene in action: a sample application 11
Creating an index 12
Searching an index 15
1.5 Understanding the core indexing classes 18
IndexWriter 19
Directory 19
Analyzer 19
Document 20
Field 20
1.6 Understanding the core searching classes 22
IndexSearcher 23
Term 23
Query 23
TermQuery 24
Hits 24
1.7 Review of alternate search products 24
IR libraries 24
Indexing and searching applications 26
Online resources 27
1.8 Summary 27
 
2 Indexing 28
2.1 Understanding the indexing process 29
Conversion to text 29
Analysis 30
Index writing 31
2.2 Basic index operations 31
Adding documents to an index 31
Removing Documents from an index 33
Undeleting Documents 36
Updating Documents in an index 36
2.3 Boosting Documents and Fields 38
2.4 Indexing dates 39
2.5 Indexing numbers 40
2.6 Indexing Fields used for sorting 41
2.7 Controlling the indexing process 42
Tuning indexing performance 42
In-memory indexing: RAMDirectory 48
Limiting Field sizes: maxFieldLength 54
2.8 Optimizing an index 56
2.9 Concurrency, thread-safety, and locking issues 59
Concurrency rules 59
Thread-safety 60
Index locking 62
Disabling index locking 66
2.10 Debugging indexing 66
2.11 Summary 67
 
3 Adding search to your application 68
3.1 Implementing a simple search feature 69
Searching for a specific term 70
Parsing a user-entered query expression: QueryParser 72
3.2 Using IndexSearcher 75
Working with Hits 76
Paging through Hits 77
Reading indexes into memory 77
3.3 Understanding Lucene scoring 78
Lucene, you got a lot of ‘splainin’ to do! 80
3.4 Creating queries programmatically 81
Searching by term: TermQuery 82
Searching within a range: RangeQuery 83
Searching on a string: PrefixQuery 84
Combining queries: BooleanQuery 85
Searching by phrase: PhraseQuery 87
Searching by wildcard: WildcardQuery 90
Searching for similar terms: FuzzyQuery 92
3.5 Parsing query expressions: QueryParser 93
Query.toString 94
Boolean operators 94
Grouping 95
Field selection 95
Range searches 96
Phrase queries 98
Wildcard and prefix queries 99
Fuzzy queries 99
Boosting queries 99
To QueryParse or not to QueryParse? 100
3.6 Summary 100
 
4 Analysis 102
4.1 Using analyzers 104
Indexing analysis 105
QueryParser analysis 106
Parsing versus analysis: when an analyzer isn’t appropriate 107
4.2 Analyzing the analyzer 107
What’s in a token? 108
TokenStreams uncensored 109
Visualizing analyzers 112
Filtering order can be important 116
4.3 Using the built-in analyzers 119
StopAnalyzer 119
StandardAnalyzer 120
4.4 Dealing with keyword fields 121
Alternate keyword analyzer 125
4.5 “Sounds like” querying 125
4.6 Synonyms, aliases, and words that mean the same 128
Visualizing token positions 134
4.7 Stemming analysis 136
Leaving holes 136
Putting it together 137
Hole lot of trouble 138
4.8 Language analysis issues 140
Unicode and encodings 140
Analyzing non-English languages 141
Analyzing Asian languages 142
Zaijian 145
4.9 Nutch analysis 145
4.10 Summary 147
 
5 Advanced search techniques 149
5.1 Sorting search results 150
Using a sort 150
Sorting by relevance 152
Sorting by index order 153
Sorting by a field 154
Reversing sort order 154
Sorting by multiple fields 155
Selecting a sorting field type 156
Using a nondefault locale for sorting 157
Performance effect of sorting 157
5.2 Using PhrasePrefixQuery 157
5.3 Querying on multiple fields at once 159
5.4 Span queries: Lucene’s new hidden gem 161
Building block of spanning, SpanTermQuery 163
Finding spans at the beginning of a field 165
Spans near one another 166
Excluding span overlap from matches 168
Spanning the globe 169
SpanQuery and QueryParser 170
5.5 Filtering a search 171
Using DateFilter 171
Using QueryFilter 173
Security filters 174
A QueryFilter alternative 176
Caching filter results 177
Beyond the built-in filters 177
5.6 Searching across multiple Lucene indexes 178
Using MultiSearcher 178
Multithreaded searching using ParallelMultiSearcher 180
5.7 Leveraging term vectors 185
Books like this 186
What category? 189
5.8 Summary 193
 
6 Extending search 194
6.1 Using a custom sort method 195
Accessing values used in custom sorting 200
6.2 Developing a custom HitCollector 201
About BookLinkCollector 202
Using BookLinkCollector 202
6.3 Extending QueryParser 203
Customizing QueryParser’s behavior 203
Prohibiting fuzzy and wildcard queries 204
Handling numeric field-range queries 205
Allowing ordered phrase queries 208
6.4 Using a custom filter 209
Using a filtered query 212
6.5 Performance testing 213
Testing the speed of a search 213
Load testing 217
QueryParser again! 218
Morals of performance testing 220
6.6 Summary 220

Part 2 Applied Lucene 221

7 Parsing common document formats 223
7.1 Handling rich-text documents 224
Creating a common DocumentHandler interface 225
7.2 Indexing XML 226
Parsing and indexing using SAX 227
Parsing and indexing using Digester 230
7.3 Indexing a PDF document 235
Extracting text and indexing using PDFBox 236
Built-in Lucene support 239
7.4 Indexing an HTML document 241
Getting the HTML source data 242
Using JTidy 242
Using NekoHTML 245
7.5 Indexing a Microsoft Word document 248
Using POI 249
Using TextMining.org’s API 250
7.6 Indexing an RTF document 252
7.7 Indexing a plain-text document 253
7.8 Creating a document-handling framework 254
FileHandler interface 255
ExtensionFileHandler 257
FileIndexer application 260
Using FileIndexer 262
FileIndexer drawbacks, and how to extend the framework 263
7.9 Other text-extraction tools 264
Document-management systems and services 264
7.10 Summary 265
 
8 Tools and extensions 267
8.1 Playing in Lucene’s Sandbox 268
8.2 Interacting with an index 269
lucli: a command-line interface 269
Luke: the Lucene Index Toolbox 271
LIMO: Lucene Index Monitor 279
8.3 Analyzers, tokenizers, and TokenFilters, oh my 282
SnowballAnalyzer 283
Obtaining the Sandbox analyzers 284
8.4 Java Development with Ant and Lucene 284
Using the <index> task 285
Creating a custom document handler 286
Installation 290
8.5 JavaScript browser utilities 290
JavaScript query construction and validation 291
Escaping special characters 292
Using JavaScript support 292
8.6 Synonyms from WordNet 292
Building the synonym index 294
Tying WordNet synonyms into an analyzer 296
Calling on Lucene 297
8.7 Highlighting query terms 300
Highlighting with CSS 301
Highlighting Hits 303
8.8 Chaining filters 304
8.9 Storing an index in Berkeley DB 307
Coding to DbDirectory 308
Installing DbDirectory 309
8.10 Building the Sandbox 309
Check it out 310
Ant in the Sandbox 310
8.11 Summary 311
 
9 Lucene ports 312
9.1 Ports’ relation to Lucene 313
9.2 CLucene 314
Supported platforms 314
API compatibility 314
Unicode support 316
Performance 317
Users 317
9.3 dotLucene 317
API compatibility 317
Index compatibility 318
Performance 318
Users 318
9.4 Plucene 318
API compatibility 319
Index compatibility 320
Performance 320
Users 320
9.5 Lupy 320
API compatibility 320
Index compatibility 322
Performance 322
Users 322
9.6 PyLucene 322
API compatibility 323
Index compatibility 323
Performance 323
Users 323
9.7 Summary 324
 
10 Case studies 325
10.1 Nutch: “The NPR of search engines” 326
More in depth 327
Other Nutch features 328
10.2 Using Lucene at jGuru 329
Topic lexicons and document categorization 330
Search database structure 331
Index fields 332
Indexing and content preparation 333
Queries 335
JGuruMultiSearcher 339
Miscellaneous 340
10.3 Using Lucene in SearchBlox 341
Why choose Lucene? 341
SearchBlox architecture 342
Search results 343
Language support 343
Reporting Engine 344
Summary 344
10.4 Competitive intelligence with Lucene in XtraMind’s XM-InformationMinder? 344
The system architecture 347
How Lucene has helped us 350
10.5 Alias-i: orthographic variation with Lucene 351
Alias-i application architecture 352
Orthographic variation 354
The noisy channel model of spelling correction 355
The vector comparison model of spelling variation 356
A subword Lucene analyzer 357
Accuracy, efficiency, and other applications 360
Mixing in context 360
References 361
10.6 Artful searching at Michaels.com 361
Indexing content 362
Searching content 367
Search statistics 370
Summary 371
10.7 I love Lucene: TheServerSide 371
Building better search capability 371
High-level infrastructure 373
Building the index 374
Searching the index 377
Configuration: one place to rule them all 379
Web tier: TheSeeeeeeeeeeeerverSide? 383
Summary 385
10.8 Conclusion 385
 
appendix A Installing Lucene 387
appendix B Lucene index format 393
appendix C Resources 408
index 415