Font Size: a A A

Building Comparable Corpora Based On Cross-language Text Similarity Metrics

Posted on:2017-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:Bah Mamadou OuryFull Text:PDF
GTID:2348330488985687Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The comparable corpus plays a more and more important role in Natural Language Processing. However, there is not a universally accepted definition of the similarity of cross language document, which has caused great difficulties to design the evaluation system of the related research. Based on the design of the stacked Auto-Encoder neural network and the Machine Translation, this paper realized two models of cross language text similarity computation.One of the most difficult tasks of cross language is that the source language and the target language have different language spaces. The difficulty of similarity computation is greatly increased because of the difference of language space since the traditional information retrieval model usually converts the document into a high latitude vector based on certain assumption; the calculation of text similarity becomes the vector similarity calculation. However, in the process of computing the similarity of cross language text, because of the emergence of the language gap, the vectors converted by different languages texts will exist in different vector spaces. Thus, it becomes an important task to complete the similarity computation of the vector in the two different vector spaces under the uniform index.In order to solve this problem, this paper presents two ways of transforming two different language spaces into a unified language space, and then use the traditional similarity computing method to complete the calculation of vector similarity, thus getting the similarity of the cross language text. The two methods proposed in this paper are respectively based on the machine translation method and the SAE based approach. The machine translation method can transform the text of different languages into the same language, so as to complete the unification of language space. The SAE (Society of Automotive Engineer) based approach is to transform the source language text and the cross language text into an intermediate language space, which makes the language space become unified. Experimental results show that the SAE based approach is less sensitive to changes in language, and uses fewer resources. While (due to the impact of Machine Translation systems,) the method based on Machine Translation makes the accuracy rate be greatly affected in the case of language change. Finally, on the basis of these two models, this paper realizes and completes the computing system of cross language text similarity.
Keywords/Search Tags:Machine Translation, Stacked Auto-Encoder, Information Retrieval, Cross-language Document Similarity
PDF Full Text Request
Related items