Font Size: a A A

Research Of Cross-Language Text Correlation Detection Technology

Posted on:2015-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:Z PengFull Text:PDF
GTID:2298330434954131Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
; Text similarity detection has always been an important task in the process of NLP. Nowadays, the research of monolingual text similarity detection algorithm has matured, but with the development of cross-language information retrieval and the international academic communication becomes more and more closely, the number of cross-language similar texts get even larger. So the measure of correlation between cross-language texts becomes particularly important.We summarize the methods of monolingual text similarity detection in this thesis, besides, we study the cross-language information retrieval and some algorithms of cross-language text correlation detection in popular. Considering the current situation in our country that cross-language plagiarism is mainly between English and Chinese, we do the following researches mainly:Since in the process of cross-language text correlation detection, machine translation technology is needed to search the candidate set of similar texts quickly, so we explore the feasibility to make use of the mainstream machine translation tools in English-Chinese text correlation detection applications. In the process of machine translation, we test different text granularity and analyze the experiment results respectively. According to the results, we propose a binary text and sentence-based algorism which both has appropriate precision and recall rate, besides, it’s more efficient. At last, we develop a system to get the candidate set quickly by combining the algorism and Minwise Hash.Since the performance of the cross-language text correlation detection algorism CL-ESA is mainly influenced by the index documents collection, we propose an algorism based on clustering to assist to build the index documents. The algorism use clustering in the process of building index documents collection to make sure that the documents have better distinction and quality. Experiment results show that our algorism not only raises the recall rate of CL-ESA but improves the time performance.
Keywords/Search Tags:cross-language correlation detection, machine translation, CL-ESA algorithm, text clustering
PDF Full Text Request
Related items