Font Size: a A A

The Study, Based On The Text Of The Chinese Information Retrieval Pretreatment

Posted on:2009-10-25Degree:MasterType:Thesis
Country:ChinaCandidate:J F HeFull Text:PDF
GTID:2208360245961767Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
At present, the Web is becoming a global repository of the human knowledge and civilization. In the repository, we can share the ideas and information in an unprecedented scope. With the rapid popularization of the Internet, the resources of sharing information in the Internet increased exponentially. How to deal with the vast amounts of information became a very important research topic. The research of Information Retrieval can help people find the information which they needed in effect. There are many contents in Information Retrieval. Among them, the text message is the most. Therefore, in order to improve accuracy for enquiry, system efficiency and utilization of space, we need the text preprocessing. This thesis research the technology of text preprocessing in the background of information retrieval.At first, we introduce the relevant technologies of text preprocessing, including text represent, Chinese word segmentation, POS (part of speech) tagging, indexing words selection and so on.Then, we research the technologies which are used in this thesis. Because different meaning is the biggest problem of Chinese word segmentation, Among them, we can remove about 90 percent of them by grammatical knowledge. The different meaning which is about semantic or pragmatics is small. So this thesis combine Chinese word segmentation with POS tagging, in the process, we use the Dynamic programming, it is favorable to digestion of different meaning. We recognize the overlapping words and unknown words among the debris which is produced in the rough cut. In the vector space model of Information Retrieval, the text is represent by the vector which is composed of words and the weights of words. So how to make the vector express the more contents of the text and how to reduce the vector space dimension are important. To solve the question, this thesis propose an algorithm of indexing words selection which has fully considered word frequency, position of word and relations with the important sentences and so on. We make an example to prove the validity of the algorithm.Finally, we use these algorithms designed a preprocessing system of text-based information retrieval. Firstly, The system punctuate the text by punctuation, according to the different locations of sentences, we give different weights to the different sentences. Secondly, we deal with various sentences, we break up a sentence with other punctuation into short sentences, we deal this special punctuation with special method. Thirdly, we segment and POS tag with a short sentence; Fourthly, We recognize the overlapping words and unknown words among the debris which is produced in the rough cut. Finally, we choose the indexing words by the algorithm which proposed in this thesis.
Keywords/Search Tags:Information Retrieval, text preprocessing, Chinese word segmentation, POS tagging, Indexing words selection
PDF Full Text Request
Related items