Font Size: a A A

A Combined Measure For Text Semantic Similarity

Posted on:2014-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:H D LiFull Text:PDF
GTID:2268330422951617Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of artificial intelligence and natural languageprocessing, text similarity calculation has become the core module of manyapplications such as semantic disambiguation, information extraction, informationretrieval, text classification, automatic question answering and data mining etc. Thesimilarity measures have been developed from word co-occurrence, grammaticalstructure to the semantic, which pushes the seeking for high accuracy efficientsemantic similarity computing techniques. Most of the existing semantic similarityalgorithms are based on statistical methods or rule based methods that areconducted on ontology dictionaries and some kind of knowledge bases. Wherein therule-based methods usually use the dictionary, the ontology tree or graph, or theco-occurrence number of attributes, while the statistical methods may choose to useor not use a knowledge base. While a statistical method of using a knowledge baseincorporates more comprehensive knowledge and has the capability of reducesknowledge noise, it usually obtains better performance among existing methods.Nevertheless, due to the imbalanced distribution of different items in a knowledgebase, the semantic similarity calculation results for low-frequency words are usuallypoor.To address above issue, this thesis presents a combined measure for semanticsimilarity calculation. At first, we studied existing statistical methods that are basedon ontology dictionary rules and corpus and compared their advantages anddisadvantages. Then the method of combing rules and statistical measures isproposed for word level semantic similarity calculation, which uses English andChinese Wikipedia database and the HowNet semantic dictionary to build the socalled Explicit Semantic Analysis model. To address the sample imbalance issue, animproved algorithm based on stop word distributions is also proposed. For thesentence level semantic similarity computation, the syntactic information, the editdistance and the semantic similarity are combined together to improve theperformance.The combined calculation method proposed in this thesis is verified byexperiments conducted on English and Chinese standard corpus and the best resultsamong all the compared methods are reached. The combined semantic similaritycomputing method can be directly applied to applications such as thegeneral-purpose automatic answering system etc.
Keywords/Search Tags:Semantic similarity, Combination of rule and statistical measure, Stopword, Sentence level semantic similarity
PDF Full Text Request
Related items