Font Size: a A A

Applied Research Of Chinese-Korean Cross-Language Text Similarity Calculation

Posted on:2022-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:L J LiFull Text:PDF
GTID:2518306338956229Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Cross-language text similarity measurement is of great significance in the field of multilingual natural language processing.With the development of information technology and artificial intelligence,the expansion of information resources is accompanied by the diversified evolution of resource languages.The Chinese nation consists of 56 ethnic groups,among which ethnic minority languages have contributed to the diversity of Chinese characters.A huge number of ethnic minority languages and written information has been integrated into the overall environment of the Internet,which has enriched the diversity of Internet resources.In order to effectively manage,mine and utilize minority language resources and break down the barrier of crosslanguage,cross-language text similarity measurement has become a fundamental topic in the application technology of multilingual text information processing.In this dissertation,a cross-language text similarity measurement method was studied based on the Chinese-Korean parallel corpus.Based on the text representation of cross-language word embedding,the co-occurrence correlation between terms in different languages was used to obtain the connection between different languages,and it was applied to calculate the similarity of cross-language text.First,we collected nearly 30,000 abstracts of Chinese-Korean scientific and technological literature.After processing,160,000 sets of sentence-level aligned Chinese-Korean parallel corpus were obtained,and extracted word alignment information from sentence pairs to construct Parallel corpus with sentence alignment in form and word alignment in content for training bilingual word embedding models.Secondly,the bilingual word embedding model was constructed using the corpus obtained in the previous step,and the word representations of the two languages mapped to the same word embedding space were obtained,and the vector representation of the text was obtained by TF-IDF weighting,and the cross-language text similarity calculation based on vector was realized by cosine similarity.On the other hand,,using the co-occurrence of Chinese and Korean word items in parallel corpus,we proposed a method to measure the correlation strength of bilingual feature terms in the sense of co-occurrence.Based on this,a calculation model of co-occurrence correlation was designed,and an improved cross-language text similarity calculation method was constructed by combining the text similarity degree based on vector representation.Finally,a prototype system of cross-language text retrieval is designed and implemented based on Django framework.The system is divided into three parts: text retrieval module,background management module and database module.After testing,the function of each module of the system has reached the desired effect.Cross-language retrieval tests indicate that the method proposed in this dissertation improves the effect of the method based on text vector representation by 9%,and achieves good results in cross-language text feature representation and text similarity measurement.The prototype system of cross-language text retrieval developed in this thesis has all the functions passed the test and can realize the task of Chinese-Korean cross-language text retrieval.
Keywords/Search Tags:Chinese-Korean Cross-language text similarity, parallel corpus, bilingual word embedding, bilingual word co-occurrence, Chinese-Korean cross-language text retrieval
PDF Full Text Request
Related items