Lucene in Action
Erik Hatcher, Otis Gospodnetic
- 出版商: Manning
- 出版日期: 2004-12-01
- 售價: $1,710
- 貴賓價: 9.5 折 $1,625
- 語言: 英文
- 頁數: 456
- 裝訂: Paperback
- ISBN: 1932394281
- ISBN-13: 9781932394283
-
相關分類:
全文搜尋引擎 Full-text-search
已過版
買這商品的人也買了...
-
$1,840$1,748 -
$650$553 -
$590$466 -
$680$537 -
$560$504 -
$480$379 -
$750$593 -
$780$616 -
$780$663 -
$490$382 -
$780$616 -
$650$553 -
$650$507 -
$680$578 -
$490$441 -
$620$490 -
$590$460 -
$580$452 -
$620$527 -
$880$581 -
$540$427 -
$550$468 -
$650$507 -
$1,100$1,078 -
$299$254
相關主題
商品描述
Descriptions:
Lucene is a gem in the open-source world--a highly scalable, fast search engine. It delivers performance and is disarmingly easy to use. Lucene in Action is the authoritative guide to Lucene. It describes how to index your data, including types you definitely need to know such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, filtering, and highlighting search results.
Lucene powers search in surprising places--in discussion groups at Fortune 100 companies, in commercial issue trackers, in email search from Microsoft, in the Nutch web search engine (that scales to billions of pages). It is used by diverse companies including Akamai, Overture, Technorati, HotJobs, Epiphany, FedEx, Mayo Clinic, MIT, New Scientist Magazine, and many others.
Adding search to your application can be easy. With many reusable examples and good advice on best practices, Lucene in Action shows you how. And if you would like to search through Lucene in Action over the Web, you can do so using Lucene itself as the search engine--take a look at the authors' awesome Search Inside solution. Its results page resembles Google's and provides a novel yet familiar interface to the entire book and book blog.
Table of Contents:
foreword xvii
preface xix
acknowledgments xxii
about this book xxv
Part 1 Core Lucene 1
- 1 Meet Lucene 3
- 1.1 Evolution of information organization and access 4
- 1.2 Understanding Lucene 6
- What Lucene is 7
- What Lucene can do for you 7
- History of Lucene 9
- Who uses Lucene 10
- Lucene ports: Perl, Python, C++, .NET, Ruby 10
- 1.3 Indexing and searching 10
- What is indexing, and why is it important? 10
- What is searching? 11
- 1.4 Lucene in action: a sample application 11
- Creating an index 12
- Searching an index 15
- 1.5 Understanding the core indexing classes 18
- IndexWriter 19
- Directory 19
- Analyzer 19
- Document 20
- Field 20
- 1.6 Understanding the core searching classes 22
- IndexSearcher 23
- Term 23
- Query 23
- TermQuery 24
- Hits 24
- 1.7 Review of alternate search products 24
- IR libraries 24
- Indexing and searching applications 26
- Online resources 27
- 1.8 Summary 27
- 2 Indexing 28
- 2.1 Understanding the indexing process 29
- Conversion to text 29
- Analysis 30
- Index writing 31
- 2.2 Basic index operations 31
- Adding documents to an index 31
- Removing Documents from an index 33
- Undeleting Documents 36
- Updating Documents in an index 36
- 2.3 Boosting Documents and Fields 38
- 2.4 Indexing dates 39
- 2.5 Indexing numbers 40
- 2.6 Indexing Fields used for sorting 41
- 2.7 Controlling the indexing process 42
- Tuning indexing performance 42
- In-memory indexing: RAMDirectory 48
- Limiting Field sizes: maxFieldLength 54
- 2.8 Optimizing an index 56
- 2.9 Concurrency, thread-safety, and locking issues 59
- Concurrency rules 59
- Thread-safety 60
- Index locking 62
- Disabling index locking 66
- 2.10 Debugging indexing 66
- 2.11 Summary 67
- 3 Adding search to your application 68
- 3.1 Implementing a simple search feature 69
- Searching for a specific term 70
- Parsing a user-entered query expression: QueryParser 72
- 3.2 Using IndexSearcher 75
- Working with Hits 76
- Paging through Hits 77
- Reading indexes into memory 77
- 3.3 Understanding Lucene scoring 78
- Lucene, you got a lot of ‘splainin’ to do! 80
- 3.4 Creating queries programmatically 81
- Searching by term: TermQuery 82
- Searching within a range: RangeQuery 83
- Searching on a string: PrefixQuery 84
- Combining queries: BooleanQuery 85
- Searching by phrase: PhraseQuery 87
- Searching by wildcard: WildcardQuery 90
- Searching for similar terms: FuzzyQuery 92
- 3.5 Parsing query expressions: QueryParser 93
- Query.toString 94
- Boolean operators 94
- Grouping 95
- Field selection 95
- Range searches 96
- Phrase queries 98
- Wildcard and prefix queries 99
- Fuzzy queries 99
- Boosting queries 99
- To QueryParse or not to QueryParse? 100
- 3.6 Summary 100
- 4 Analysis 102
- 4.1 Using analyzers 104
- Indexing analysis 105
- QueryParser analysis 106
- Parsing versus analysis: when an analyzer isn’t appropriate 107
- 4.2 Analyzing the analyzer 107
- What’s in a token? 108
- TokenStreams uncensored 109
- Visualizing analyzers 112
- Filtering order can be important 116
- 4.3 Using the built-in analyzers 119
- StopAnalyzer 119
- StandardAnalyzer 120
- 4.4 Dealing with keyword fields 121
- Alternate keyword analyzer 125
- 4.5 “Sounds like” querying 125
- 4.6 Synonyms, aliases, and words that mean the same 128
- Visualizing token positions 134
- 4.7 Stemming analysis 136
- Leaving holes 136
- Putting it together 137
- Hole lot of trouble 138
- 4.8 Language analysis issues 140
- Unicode and encodings 140
- Analyzing non-English languages 141
- Analyzing Asian languages 142
- Zaijian 145
- 4.9 Nutch analysis 145
- 4.10 Summary 147
- 5 Advanced search techniques 149
- 5.1 Sorting search results 150
- Using a sort 150
- Sorting by relevance 152
- Sorting by index order 153
- Sorting by a field 154
- Reversing sort order 154
- Sorting by multiple fields 155
- Selecting a sorting field type 156
- Using a nondefault locale for sorting 157
- Performance effect of sorting 157
- 5.2 Using PhrasePrefixQuery 157
- 5.3 Querying on multiple fields at once 159
- 5.4 Span queries: Lucene’s new hidden gem 161
- Building block of spanning, SpanTermQuery 163
- Finding spans at the beginning of a field 165
- Spans near one another 166
- Excluding span overlap from matches 168
- Spanning the globe 169
- SpanQuery and QueryParser 170
- 5.5 Filtering a search 171
- Using DateFilter 171
- Using QueryFilter 173
- Security filters 174
- A QueryFilter alternative 176
- Caching filter results 177
- Beyond the built-in filters 177
- 5.6 Searching across multiple Lucene indexes 178
- Using MultiSearcher 178
- Multithreaded searching using ParallelMultiSearcher 180
- 5.7 Leveraging term vectors 185
- Books like this 186
- What category? 189
- 5.8 Summary 193
- 6 Extending search 194
- 6.1 Using a custom sort method 195
- Accessing values used in custom sorting 200
- 6.2 Developing a custom HitCollector 201
- About BookLinkCollector 202
- Using BookLinkCollector 202
- 6.3 Extending QueryParser 203
- Customizing QueryParser’s behavior 203
- Prohibiting fuzzy and wildcard queries 204
- Handling numeric field-range queries 205
- Allowing ordered phrase queries 208
- 6.4 Using a custom filter 209
- Using a filtered query 212
- 6.5 Performance testing 213
- Testing the speed of a search 213
- Load testing 217
- QueryParser again! 218
- Morals of performance testing 220
- 6.6 Summary 220
Part 2 Applied Lucene 221
- 7 Parsing common document formats 223
- 7.1 Handling rich-text documents 224
- Creating a common DocumentHandler interface 225
- 7.2 Indexing XML 226
- Parsing and indexing using SAX 227
- Parsing and indexing using Digester 230
- 7.3 Indexing a PDF document 235
- Extracting text and indexing using PDFBox 236
- Built-in Lucene support 239
- 7.4 Indexing an HTML document 241
- Getting the HTML source data 242
- Using JTidy 242
- Using NekoHTML 245
- 7.5 Indexing a Microsoft Word document 248
- Using POI 249
- Using TextMining.org’s API 250
- 7.6 Indexing an RTF document 252
- 7.7 Indexing a plain-text document 253
- 7.8 Creating a document-handling framework 254
- FileHandler interface 255
- ExtensionFileHandler 257
- FileIndexer application 260
- Using FileIndexer 262
- FileIndexer drawbacks, and how to extend the framework 263
- 7.9 Other text-extraction tools 264
- Document-management systems and services 264
- 7.10 Summary 265
- 8 Tools and extensions 267
- 8.1 Playing in Lucene’s Sandbox 268
- 8.2 Interacting with an index 269
- lucli: a command-line interface 269
- Luke: the Lucene Index Toolbox 271
- LIMO: Lucene Index Monitor 279
- 8.3 Analyzers, tokenizers, and TokenFilters, oh my 282
- SnowballAnalyzer 283
- Obtaining the Sandbox analyzers 284
- 8.4 Java Development with Ant and Lucene 284
- Using the <index> task 285
- Creating a custom document handler 286
- Installation 290
- 8.5 JavaScript browser utilities 290
- JavaScript query construction and validation 291
- Escaping special characters 292
- Using JavaScript support 292
- 8.6 Synonyms from WordNet 292
- Building the synonym index 294
- Tying WordNet synonyms into an analyzer 296
- Calling on Lucene 297
- 8.7 Highlighting query terms 300
- Highlighting with CSS 301
- Highlighting Hits 303
- 8.8 Chaining filters 304
- 8.9 Storing an index in Berkeley DB 307
- Coding to DbDirectory 308
- Installing DbDirectory 309
- 8.10 Building the Sandbox 309
- Check it out 310
- Ant in the Sandbox 310
- 8.11 Summary 311
- 9 Lucene ports 312
- 9.1 Ports’ relation to Lucene 313
- 9.2 CLucene 314
- Supported platforms 314
- API compatibility 314
- Unicode support 316
- Performance 317
- Users 317
- 9.3 dotLucene 317
- API compatibility 317
- Index compatibility 318
- Performance 318
- Users 318
- 9.4 Plucene 318
- API compatibility 319
- Index compatibility 320
- Performance 320
- Users 320
- 9.5 Lupy 320
- API compatibility 320
- Index compatibility 322
- Performance 322
- Users 322
- 9.6 PyLucene 322
- API compatibility 323
- Index compatibility 323
- Performance 323
- Users 323
- 9.7 Summary 324
- 10 Case studies 325
- 10.1 Nutch: “The NPR of search engines” 326
- More in depth 327
- Other Nutch features 328
- 10.2 Using Lucene at jGuru 329
- Topic lexicons and document categorization 330
- Search database structure 331
- Index fields 332
- Indexing and content preparation 333
- Queries 335
- JGuruMultiSearcher 339
- Miscellaneous 340
- 10.3 Using Lucene in SearchBlox 341
- Why choose Lucene? 341
- SearchBlox architecture 342
- Search results 343
- Language support 343
- Reporting Engine 344
- Summary 344
- 10.4 Competitive intelligence with Lucene in XtraMind’s XM-InformationMinder? 344
- The system architecture 347
- How Lucene has helped us 350
- 10.5 Alias-i: orthographic variation with Lucene 351
- Alias-i application architecture 352
- Orthographic variation 354
- The noisy channel model of spelling correction 355
- The vector comparison model of spelling variation 356
- A subword Lucene analyzer 357
- Accuracy, efficiency, and other applications 360
- Mixing in context 360
- References 361
- 10.6 Artful searching at Michaels.com 361
- Indexing content 362
- Searching content 367
- Search statistics 370
- Summary 371
- 10.7 I love Lucene: TheServerSide 371
- Building better search capability 371
- High-level infrastructure 373
- Building the index 374
- Searching the index 377
- Configuration: one place to rule them all 379
- Web tier: TheSeeeeeeeeeeeerverSide? 383
- Summary 385
- 10.8 Conclusion 385
appendix A Installing Lucene 387
appendix B Lucene index format 393
appendix C Resources 408
index 415