Font Size: a A A

Research On Topic Discovery And Topic Representation Methods For Chinese-Vietnamese Bilingual New

Posted on:2023-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:L J XiaFull Text:PDF
GTID:2568306797982629Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Under the background of the Belt and Road Initiative,the exchanges between China and Vietnam are increasingly close,and there are more and more news topics of common concern between the two countries.When a news event occurs,the news media of the two countries will cover a lot of topics of common concern.Timely understanding of news topics and main contents of common concern between the two countries is of great application value to promoting the exchange and cooperation between China and Vietnam.However,there is a semantic gap between Chinese and Vietnamese,and it is difficult to generate concise and correct topic representation because of the complex correlation between multiple languages and documents.Therefore,this thesis mainly focuses on the topic discovery and topic representation tasks of Chinese-Vietnamese bilingual news.The specific research contents are as follows:(1)Construct topic task corpus of Chinese-Vietnamese news.Chinese-Vietnamese news data is the basis and premise of related research.Currently,there are no public datasets suitable for Chinese-Vietnamese topic discovery and topic representation tasks.Therefore,this paper obtains relevant ChineseVietnamese and Vietnamese news data sets from major information websites through Internet crawlers.Combined with the research tasks of this paper,we formulate labeling rules and construct 10 topic sets,including 7664 Chinese texts,5184 Vietnamese texts,and a total of 12858 news texts.On this basis,the Chinese-Vietnamese bilingual news topic representation dataset of 10 topic sets is annotated,and a Chinese-Vietnamese bilingual news topic task corpus is constructed.It plays an important role in the followup research work of this paper.(2)A method for topic discovery of Chinese-Vietnamese bilingual news based on generative adversarial network is proposed.The Chinese-Vietnamese bilingual news topic discovery task aims to cluster Chinese and Vietnamese news texts describing the same topic under the same topic cluster.However,Vietnamese is a low-resource language,with less knowledge and resources for Chinese-Vietnamese bilingual dictionaries and parallel sentence alignments,and the existing multilingual pre-trained language models have weak generalization capabilities in small languages,and can be used in news topic analysis.It does not work well on downstream tasks.Aiming at this problem,this paper proposes a method for topic discovery in Chinese-Vietnamese bilingual news based on Generative Adversarial Networks.By incorporating bilingual topic information,the method maps Chinese-Vietnamese news texts to the same semantic space for computation based on the topic-constrained Chinese-Vietnamese bilingual word vectors pre-trained by generative adversarial networks.Then,the K-means clustering algorithm is used to cluster the texts in the two languages,and the news texts describing the same topic are clustered together to realize the discovery of Chinese-Vietnamese bilingual news topics.Experiments show that the method proposed in this paper improves the F1 value of the baseline model by an average of 4%.(3)A method of Chinese-Vietnamese bilingual news topic representation based on heterogeneous graph is proposed.Chinese-Vietnamese bilingual news topic representation is to generate concise sentences that can correctly describe the topic from Chinese-Vietnamese bilingual news texts describing the same topic.It can be regarded as a text generation task in a multilanguage and multi-text scenario.Due to the complex relationship between multilanguage and multi-text,it is difficult for existing topic generation models to effectively model the relationship between multi-language texts,resulting in a deviation between the generated topic representation and the topic described in the original text.Aiming at this problem,this paper proposes a multilingual topic representation method that models complex linguistic relationships between multilingual and multi-text through heterogeneous graphs and incorporates topic knowledge.The method first uses a heterogeneous graph containing sentences and entity nodes to represent ChineseVietnamese bilingual news texts,and effectively models complex associations between multilingual and multi-text through GAT.Then,the topic words are encoded into clues for topic representation generation by topic encoder,and the decoder-side constraints are incorporated to generate correct topic representations.Experimental results show that the ROUGE value of the proposed method and baseline method is up to 6% higher.(4)Design and implement a Chinese-Vietnamese bilingual news topic discovery and topic representation prototype system.The Chinese-Vietnamese bilingual news topic discovery and topic representation prototype system designed in this paper integrates the above research work,and realizes three functional modules of news collection,hot topics and news retrieval,aiming to provide users with a visual platform to understand my country and Vietnam more quickly.Hot news topics of common concern.
Keywords/Search Tags:News text, Across language, Topic discovery, Topic expression, Chinese Vietnamese bilingual
PDF Full Text Request
Related items