Font Size: a A A

Research On WordNet Based Chinese-english Cross Language Text Similarity Measurement

Posted on:2012-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:W L HeFull Text:PDF
GTID:2218330362959375Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Text similarity measurement has been an important topic in natural language processing for quite long time. It is widely used in information retrieval, text mining, plagiarism detection, etc. The nature of text similarity measurement is to quantizing the similarity between two different texts. Most of the existing research is focused on similarity measurement of monolingual texts. However, with the rapid development of Internet, we are faced with real-time information in different languages and from different places all over the world. Thus, cross-language similarity search and cross-language plagiarism detection have become highly valued among research organizations. Via cross-language similarity search, we can use text of one language to search for all related texts in different languages. Through cross-language plagiarism detection, we can judge if an article is possibly a translation of other article in a different language. The technology behind the scene of these applications is no one but cross-language text similarity measurement.We are focused on cross-language text similarity measurement in this paper, namely the algorithm to quantizing similarity between texts of different languages. Most of the existing cross-language similarity algorithms cannot perform well in all the following aspects at the same time: accuracy, efficiency, availability, extensibility. We proposed a novel approach for cross-language similarity measurement in this paper.The innovations of this paper include the following:1) Built a language independent semantic mid-layer based on WordNet, and also implemented a noun semantic hashing on that mid-layer. The semantic hashing preserves the semantic distance property of words, which guarantees a positive correlation between the difference of semantic hash values for two words and the semantic distance of the two words. By projecting different languages onto a unified semantic mid-layer, we are able to convert the preprocessed texts of different languages to feature sequences of semantic hashes. And this enables us to easily calculate cross-language text similarity on the mid-layer.2) Proposed a novel approach for noun sense disambiguation based on concept correlation, which is applied in the process of obtaining feature sequences of texts. Different from existing algorithms, we extended the notion of semantic distance by defining a semantic density for a group of word senses, thus quantizing the correlation among a group of word senses. We disambiguate noun sense after converting the correlation into semantic density. Besides, with the help of semantic hash mention above, we greatly reduce the time complexity of calculating semantic density and that of the whole disambiguation algorithm.3) Proposed a feature filtering algorithm based on sense frequency. We turn sense frequency into sense depth on WordNet approximately. And with the help of semantic hashing, we finally implemented the filtering via efficient bitwise operation on semantic hashing, which rule out high frequency features.We finally evaluated the new algorithms proposed in this paper with experiments on Chinese and English texts. The design of our algorithm guarantees that it has good availability and scalability. Thus it's not difficult to apply our algorithm to other languages that have a WordNet, though we carried out experiments only on Chinese and English. We tested our noun sense disambiguation algorithm on SemCor, and the result shows that our algorithm is quite good in terms of accuracy comparing to other non-statistic approaches. We also tested our cross-language similarity measurement algorithm on a small Chinese-English parallel corpus built by ourselves. It shows that our approach has quite good accuracy. It yields an accuracy of 71.7% for the first 10 items in cross language similarity search. Besides, the effectiveness of the feature filtering algorithm we proposed is also verified by experiments.
Keywords/Search Tags:text similarity measurement, cross-language similarity measurement, word sense disambiguation, noun sense disambiguation
PDF Full Text Request
Related items