Font Size: a A A

The Research And Application Of Unstructured Data Processing Technology

Posted on:2012-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:S N WenFull Text:PDF
GTID:2178330335960594Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Full text search refers to a technique for searching a computer-stored document. Information retrieval from large amounts of data can be dramatically improved using this technique. The technique under text search engines has advanced dramatically in the past decade. There are many excellent full-text searching tools which can be got freely through Internet. We mainly use two of them here. One is Microsoft Windows Indexing Service and the other is Apache Lucene.Windows Indexing Service is a basic service of Microsoft Windows NT that extracts content from files and constructs an indexed catalog to facilitate efficient and rapid searching. Indexing Service can extract content using corresponding filter from any document regardless of its format. In this paper we will use Indexing Service to build a distributed information retrieval system that can help user to search files in any indexing host by only one query. The system consists of three programs. The Server program is designed to hide all indexing host. From user's point of view there is only one server host. The Indexing Service Management program is used to add and delete scopes form catalog which is the basic unit of Indexing Service. And query program is mainly responsible for querying local Indexing Service. One aims of the system is to response as quickly as possible and other is robust enough to deal with some common exceptions.Lucene is a powerful Java search library that lets you easily add search to any application. In recent years Lucene has become exceptionally popular and is now the most widely used information retrieval library:it powers the search features behind many Web sites and desktop tools. In this paper we will use Lucene to implement a simple text classification program based on vector space model. The program use term vectors to represent the document which is one of Lucene's features and it can illustrate the basic principles used in text classification.
Keywords/Search Tags:full-text information retrieval, Indexing Service, massive data, text classification
PDF Full Text Request
Related items