The Study, Based On The Text Of The Chinese Information Retrieval Pretreatment

Posted on:2009-10-25

Degree:Master

Type:Thesis

Country:China

Candidate:J F He

Full Text:PDF

GTID:2208360245961767

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

At present, the Web is becoming a global repository of the human knowledge and civilization. In the repository, we can share the ideas and information in an unprecedented scope. With the rapid popularization of the Internet, the resources of sharing information in the Internet increased exponentially. How to deal with the vast amounts of information became a very important research topic. The research of Information Retrieval can help people find the information which they needed in effect. There are many contents in Information Retrieval. Among them, the text message is the most. Therefore, in order to improve accuracy for enquiry, system efficiency and utilization of space, we need the text preprocessing. This thesis research the technology of text preprocessing in the background of information retrieval.At first, we introduce the relevant technologies of text preprocessing, including text represent, Chinese word segmentation, POS (part of speech) tagging, indexing words selection and so on.Then, we research the technologies which are used in this thesis. Because different meaning is the biggest problem of Chinese word segmentation, Among them, we can remove about 90 percent of them by grammatical knowledge. The different meaning which is about semantic or pragmatics is small. So this thesis combine Chinese word segmentation with POS tagging, in the process, we use the Dynamic programming, it is favorable to digestion of different meaning. We recognize the overlapping words and unknown words among the debris which is produced in the rough cut. In the vector space model of Information Retrieval, the text is represent by the vector which is composed of words and the weights of words. So how to make the vector express the more contents of the text and how to reduce the vector space dimension are important. To solve the question, this thesis propose an algorithm of indexing words selection which has fully considered word frequency, position of word and relations with the important sentences and so on. We make an example to prove the validity of the algorithm.Finally, we use these algorithms designed a preprocessing system of text-based information retrieval. Firstly, The system punctuate the text by punctuation, according to the different locations of sentences, we give different weights to the different sentences. Secondly, we deal with various sentences, we break up a sentence with other punctuation into short sentences, we deal this special punctuation with special method. Thirdly, we segment and POS tag with a short sentence; Fourthly, We recognize the overlapping words and unknown words among the debris which is produced in the rough cut. Finally, we choose the indexing words by the algorithm which proposed in this thesis.

Keywords/Search Tags:

Information Retrieval, text preprocessing, Chinese word segmentation, POS tagging, Indexing words selection

PDF Full Text Request

Related items

1	Full-text Search For The Modern Chinese Text Processing, Automatic Word Generic System
2	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
3	Chinese Pos Tagging Study
4	Research And Implementation Of Chinese Word Segmentation System For Enterprise Information Retrieval
5	Study On Efficient Indexing For Large Scale Chinese Text Retrieval Systems
6	Study Of Text Categorization And Image Restoration In Modern Information Retrieval
7	Research And Application On Chinese Automatic Word Segmentation In Full Text Retrieval
8	Research On Chinese Word Segmentation Algorithm Based On News Text
9	Lucene Chinese Word Segmentation Applied Research, Research Document Full-text Retrieval System
10	Chinese Word Segmentation Method Based On Dictionary And Statistics Of The Words