Font Size: a A A

The Description Of Text's Feature Based On Semanteme Concept

Posted on:2006-07-04Degree:MasterType:Thesis
Country:ChinaCandidate:G YuFull Text:PDF
GTID:2168360155972929Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The description of the text's feature is a fundmental work for NPL ,document categorizing and clustering, Chinese information intrieval, personal service and so on. It focuses on the method and model to present the topic better. The feature discription should summarize the content of the document on one aspect; It also should think about that the model facilitate the computer's processing. Currently, the VSM is used widely. The VSM use several feature words and their weights to present a document. In this model, there are two factors affecting the description's precision: one is the choice of the feature words; another is the method of weight computing. Most of the scholars'research focus on these two points and they hope to summarize the documents'topics and reflect their connotative information. Utilizing the statistics and the knowledge of information entropy to choose the feature words and compute their weights, these two methods improved the VSM's precision to describe the document to some extent. But there are few method can reflect the feature terms'semanteme. This paper mainly discuss how to solve the problem that reflect the VSM's terms'semantic information from the following two aspects: (I) Considering that the context has great impact on the word's right semanteme, we improve on the TF-IDF method which is most widely used to compute the term's weight. Our method is based on the words co-occurrence. This method contains TF-IDF's information and also reflect the specific context's impact on words'semanteme. (II) As for comparing the texts'similarity,we abandon the pure mathematical method(e.g. the Euclidean distance, the cosine of the vectors's angle, Bayes Algorithm, K-means and so on). Instead, we compute the similarity of different vector's terms firstly and compute the the largest power match of the two vectors. Lastly, we compute the sum of the match-pair's similarity and the terms'weights should also be considered. The advantage of our method exists in : it considers the terms'semanteme, avoid dispelling ambiguity and normalization. At last, we construct a classifier to compare our method with others. We use experiments to prove that our method has improved the precision and recall to some extent. Althoug our research aims at personal service, it can be applied to chinese information retrieaval and NPL.
Keywords/Search Tags:describe character, standardization, word disambiguation, term-weighing, word co-occurrence, sememe, Similarity Computing, matching
PDF Full Text Request
Related items