The Description Of Text's Feature Based On Semanteme Concept

Posted on:2006-07-04

Degree:Master

Type:Thesis

Country:China

Candidate:G Yu

Full Text:PDF

GTID:2168360155972929

Subject:Computer application technology

Abstract/Summary:

The description of the text's feature is a fundmental work for NPL ,document categorizing and clustering, Chinese information intrieval, personal service and so on. It focuses on the method and model to present the topic better. The feature discription should summarize the content of the document on one aspect; It also should think about that the model facilitate the computer's processing. Currently, the VSM is used widely. The VSM use several feature words and their weights to present a document. In this model, there are two factors affecting the description's precision: one is the choice of the feature words; another is the method of weight computing. Most of the scholars'research focus on these two points and they hope to summarize the documents'topics and reflect their connotative information. Utilizing the statistics and the knowledge of information entropy to choose the feature words and compute their weights, these two methods improved the VSM's precision to describe the document to some extent. But there are few method can reflect the feature terms'semanteme. This paper mainly discuss how to solve the problem that reflect the VSM's terms'semantic information from the following two aspects: (I) Considering that the context has great impact on the word's right semanteme, we improve on the TF-IDF method which is most widely used to compute the term's weight. Our method is based on the words co-occurrence. This method contains TF-IDF's information and also reflect the specific context's impact on words'semanteme. (II) As for comparing the texts'similarity,we abandon the pure mathematical method(e.g. the Euclidean distance, the cosine of the vectors's angle, Bayes Algorithm, K-means and so on). Instead, we compute the similarity of different vector's terms firstly and compute the the largest power match of the two vectors. Lastly, we compute the sum of the match-pair's similarity and the terms'weights should also be considered. The advantage of our method exists in : it considers the terms'semanteme, avoid dispelling ambiguity and normalization. At last, we construct a classifier to compare our method with others. We use experiments to prove that our method has improved the precision and recall to some extent. Althoug our research aims at personal service, it can be applied to chinese information retrieaval and NPL.

Keywords/Search Tags:

describe character, standardization, word disambiguation, term-weighing, word co-occurrence, sememe, Similarity Computing, matching

Related items

1	Study On How To Describe Characters Of Web Pages In Personalization Service
2	Context Computing Applications, Word Disambiguation
3	Research On Word Sense Disambiguation Based On GCN Model
4	Computation Of Word Similarity And Its Application In Question Answering System
5	Research Of Word Sense Disambiguation Based On Hybird Features And Rules
6	Research On Chinese Word Sense Disambiguation Method Based On Graph Model
7	Research On Domain-Specific Term Extraction Based On Semi-Supervised Learning
8	Research On Word Sense Disambiguation And Keyword Expansion In Question Answering System
9	Research On Word Sense Disambiguation Method Based On Word Embedding
10	Research Of Similarity Based On Relative Word Frequency