Font Size: a A A

Semantic Similarity Calculation Text Field Vector Space Model

Posted on:2014-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:G TangFull Text:PDF
GTID:2268330401954096Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Psychologically, similarity is the reaction to some stimulation made by two objects, while the degree of similarity (DOS) makes it quantitative. In the real word, text is the most important and popular information carrier. The automatic text similarity computation is the key issue in information processing. And it universally used in data mining, information retrieval, machine translation, text classification, information filtering and so on. Due to the complexity and diversity of Chinese, VSM has some drawbacks in similarity computation, the calculating of DOS in Chinese has been one of the hottest topic in computer science.To improve the performance of similarity computation, we propose an efficient algorithm to calculating the similarity of Chinese. The work of this thesis is as follows:1. Study and analyze the existing algorithm of DOS, especially in Vector Space Modeling and Ontology.2. Based on Vector Space Modeling and Chinese semantic analysis, we propose a new algorithm to calculating DOS of Chinese text, in which both corpus statistics and semantic analysis are taken into consideration. In the new algorithm,"semantic topic" is used as vector space dimension to integrate semantic factor into the text denotation vector. It helps to bridge the gap between the VSM text denotation and the "real" text characteristic.3. In preprocessing, to reduce the dimensionality of vector, we propose a special algorithm to filter synonyms, which makes computation more efficient.4. To estimate the algorithm, we make a text-classification system. The results of classification show the new algorithm is more accurate than traditional VSM. All the training set and testing set are from Sogou corpus.
Keywords/Search Tags:Text Similarity, Vector Space Modeling, Semantic Analysis, TongYiCiCiLi
PDF Full Text Request
Related items