Font Size: a A A

Full-text Search For The Modern Chinese Text Processing, Automatic Word Generic System

Posted on:2007-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:S HeFull Text:PDF
GTID:2208360185476761Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Full-text retrieval is an important information retrieval technology. It is a powerful tool fordealing with nonstructural data, and is one of the key technologies of the search engine. This paper deeply research on Chinese full-text retrieval technology. This paper pays more attention in application of full-text retrieval technologies. How to use new technique, optimize the structure of retrieval system, improve performance and efficiency, quicken search speed and adapt the development of current web is also discussed in this paper.Full-text retrieval is an I/O intensive application. Its previous developments are carried on the basis of relation database. This paper deeply discusses the abuse and deficiency of this mode according to its characteristic. Because the development platform of full-text retrieval is absentcurrently, Lucene, a full-text search engine toolkit, is introduced into the paper. It has powerful performance and its body is cabinet, capable and vigorous. this convenient for it embedded applications. At present, Lucene is employed world abroad, so that many professional companies such as IBM also use its core code. As an open source code soft, Lucene offer a super excellent chance to study search engine key technology. It is worthful to take a parse research and carry second development to it.The important link in Chinese information processing is automatic Chinese words segmentation and part of speech (pos) tagging. Compared with other languages, automatic Chinese words segmentation and part of speech tagging have their peculiar difficulty. Particularly, this paper has carried on research in several following respects: 1, We analyze and compare to the construction of several kinds of daily electronic dictionaries, then, we improve the flexibility and adaptability of our system through realizing a strategy of double dictionaries which is a key dictionary and a professional dictionary .2 ) We discuss the ambiguity of Chinese words segmentation. All phrases of crossing ambiguity and combinatorial ambiguity can be pursued only once through the strategy which can ensure the ambiguous phrases according to an oriented graph of words segmentation. Utilizing the non-ambiguous form, we can give correct segmentation result for ambiguous phrases.3) Based on role tagging, we automatically recognize unknown words including Chinese personal name, place name and foreigner of translated name.4) We analyze and process digital word and reduplicative compound.5) We eliminate ambiguity in segmentation and tagging using a combined system of automatic Chinese words segmentation which is based on Hidden Markov model.
Keywords/Search Tags:Full-text Retrieval, Lucene, automatic Chinese words segmentation, part of speech tagging
PDF Full Text Request
Related items