Full-text Search For The Modern Chinese Text Processing, Automatic Word Generic System

Posted on:2007-01-07

Degree:Master

Type:Thesis

Country:China

Candidate:S He

Full Text:PDF

GTID:2208360185476761

Subject:Linguistics and Applied Linguistics

Abstract/Summary:

PDF Full Text Request

Full-text retrieval is an important information retrieval technology. It is a powerful tool fordealing with nonstructural data, and is one of the key technologies of the search engine. This paper deeply research on Chinese full-text retrieval technology. This paper pays more attention in application of full-text retrieval technologies. How to use new technique, optimize the structure of retrieval system, improve performance and efficiency, quicken search speed and adapt the development of current web is also discussed in this paper.Full-text retrieval is an I/O intensive application. Its previous developments are carried on the basis of relation database. This paper deeply discusses the abuse and deficiency of this mode according to its characteristic. Because the development platform of full-text retrieval is absentcurrently, Lucene, a full-text search engine toolkit, is introduced into the paper. It has powerful performance and its body is cabinet, capable and vigorous. this convenient for it embedded applications. At present, Lucene is employed world abroad, so that many professional companies such as IBM also use its core code. As an open source code soft, Lucene offer a super excellent chance to study search engine key technology. It is worthful to take a parse research and carry second development to it.The important link in Chinese information processing is automatic Chinese words segmentation and part of speech (pos) tagging. Compared with other languages, automatic Chinese words segmentation and part of speech tagging have their peculiar difficulty. Particularly, this paper has carried on research in several following respects: 1, We analyze and compare to the construction of several kinds of daily electronic dictionaries, then, we improve the flexibility and adaptability of our system through realizing a strategy of double dictionaries which is a key dictionary and a professional dictionary .2 ) We discuss the ambiguity of Chinese words segmentation. All phrases of crossing ambiguity and combinatorial ambiguity can be pursued only once through the strategy which can ensure the ambiguous phrases according to an oriented graph of words segmentation. Utilizing the non-ambiguous form, we can give correct segmentation result for ambiguous phrases.3) Based on role tagging, we automatically recognize unknown words including Chinese personal name, place name and foreigner of translated name.4) We analyze and process digital word and reduplicative compound.5) We eliminate ambiguity in segmentation and tagging using a combined system of automatic Chinese words segmentation which is based on Hidden Markov model.

Keywords/Search Tags:

Full-text Retrieval, Lucene, automatic Chinese words segmentation, part of speech tagging

PDF Full Text Request

Related items

1	The Research On A Lucene-based Full-text Retrieval Model
2	Lucene Chinese Word Segmentation Applied Research, Research Document Full-text Retrieval System
3	Research Of Search Engine Key Technique And Optimize Performance
4	Research On The Methods Of Automatic Correction Of Chinese Word Segmentation And Part-of-Speech Tagging
5	Application Study Of Lucene Full-text Retrieval On The Network Education Platform
6	Research And Application Of Full-text Retrieval Technology Based On Lucene
7	Development And Maintenance Of Full-text Retrieval Web System Based On Lucene
8	Full-text Retrieval Of Distributed Geological Survey Data Based On Lucene
9	Research And Application Of Lucene Full-text Retrieval Technology In Patent Information Service Platform
10	Research On Chinese Word Segmentation And Part-of-speech Tagging Based On Deep Learning Methods