The Research And Application Of Unstructured Data Processing Technology

Posted on:2012-09-08

Degree:Master

Type:Thesis

Country:China

Candidate:S N Wen

Full Text:PDF

GTID:2178330335960594

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

Full text search refers to a technique for searching a computer-stored document. Information retrieval from large amounts of data can be dramatically improved using this technique. The technique under text search engines has advanced dramatically in the past decade. There are many excellent full-text searching tools which can be got freely through Internet. We mainly use two of them here. One is Microsoft Windows Indexing Service and the other is Apache Lucene.Windows Indexing Service is a basic service of Microsoft Windows NT that extracts content from files and constructs an indexed catalog to facilitate efficient and rapid searching. Indexing Service can extract content using corresponding filter from any document regardless of its format. In this paper we will use Indexing Service to build a distributed information retrieval system that can help user to search files in any indexing host by only one query. The system consists of three programs. The Server program is designed to hide all indexing host. From user's point of view there is only one server host. The Indexing Service Management program is used to add and delete scopes form catalog which is the basic unit of Indexing Service. And query program is mainly responsible for querying local Indexing Service. One aims of the system is to response as quickly as possible and other is robust enough to deal with some common exceptions.Lucene is a powerful Java search library that lets you easily add search to any application. In recent years Lucene has become exceptionally popular and is now the most widely used information retrieval library:it powers the search features behind many Web sites and desktop tools. In this paper we will use Lucene to implement a simple text classification program based on vector space model. The program use term vectors to represent the document which is one of Lucene's features and it can illustrate the basic principles used in text classification.

Keywords/Search Tags:

full-text information retrieval, Indexing Service, massive data, text classification

PDF Full Text Request

Related items

1	Research And Implementation Of Distribute Massive Text Data Index And Retrieval System
2	Massive Data Storage And Full-text Search
3	The Research And Implementation Of Full-text Retrieval System Based On Lucene
4	Full-Text Search Technology Research And Application In "2008 Olympic Games" Multi-Language System
5	The Design And Implementation Of Full-text Indexing System Base On COM Technology
6	Chinese Full Text Retrieval Based On SQL Server 2000
7	Research On The Distributed Indexing Platform And Information Filter In Distributed Full-text Retrieval System
8	Information Retrieval Oriented Text Classification Technology Research
9	Research On Optimization Of Indexing Algorithm In Full-text Retrieval
10	The Design And Implement Of Fast Indexing Files Structure And Full-text's Information Retrieval System