Font Size: a A A

Chinese Text Similarity Research Based On Semantic And Text Structure

Posted on:2016-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhongFull Text:PDF
GTID:2308330461452256Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the popularization of computer and the rapid development of network, the number of all kinds of electronic document on the Internet is growing at an unprecedented speed, and then profound changes have taken place in the way people obtain knowledge. In the face of such a great ocean of knowledge, how to quickly find information becomes very important. And effective computing of text similarity plays a vital role in information processing, and can be widely applied to text clustering, information retrieval, question answering system, duplicated webpage, text classification and many other fields.At present most of the text similarity algorithm is based on the English text, on the basis of this kind of algorithm though to be able to get a good English text similarity measure, while cannot effectively solve the synonyms, polysemy and other natural language problem of chinese text. Although there are also many domestic experts and scholars put forward similarity measure method, which based on chinese text, but there still exists some problems, such as Jin Yaogong proposed computing text similarity calculation method based on the context framework takes into account the more text semantic information, but ignores the text structure information, at the same time, combined with some characteristics of chinese text itself, such as no obvious space between words,polysemy, synonyms, mixed tendency, etc., all of these increase the processing difficulty of chinese text metric.The paper conduct a research on the existing chinese text similarity calculation methods from the perspective of theory, method and application combine, and analysis the concrete application of the text clustering, classification, retrieval, on this basis to improve the traditional algorithm, to improve the effectiveness and accuracy of the text similarity calculation, so as to more accurately compare the similarity between the text, and provide theory and decision support for related applications. With the characteristics of the Chinese text itself, this thesis proposes two novel text similarity measurement method, respectively is the text similarity algorithm CST-TS based on concept subtree and text similarity algorithm GM-TS based on graph model, CST-TS algorithm adopted the statistical method and semantic dictionary, with the idea of concept tree,subtree is set to find matching subtree of the text, and with the help of the subtree matched to measure the text similarity, and the algorithm by reducing the key dimension vector space to improve the performance of text similarity measure, although the CST-TS algorithm can improve the text similarity measure precision, but without considering the hierarchical structure information of the text itself, and GM-TS uses illustration pattern express text, maximum retained the text semantic and structural information, using graph nodes similar to ensure measurement accuracy of text from the semantic angle, to achieve semantic understanding the effect of the model to measure, at the same time considering the text itself structure information, and further improve on measure precision. In terms of calculation efficiency, due to the GM-TS algorithm need to figure similarity measure on graph structure, relative to the semantic understanding model, time complexity will increase, but not obvious, and it does have the obvious improvement on measure precision.
Keywords/Search Tags:Text Similarity, Information Retrieval, Concept SubTree, Graphics Mode, Characteristic Words
PDF Full Text Request
Related items