| With the popularization of Internet applications and the deepening of international cooperation,people hope that the information they obtain is no longer limited to their own information.The traditional monolingual-language information retrieval technology and text classification methods can no longer meet people’s information retrieval needs.In order to solve this problem,cross-language text classification came into being.The cross-language text classification method aims to solve multilingual text classification and share training data between different languages.The research focus of the existing cross-language text classification methods is on language space conversion,such as parallel corpus method,machine translation method,dictionary method,and word embedding method.These methods have topic drift,translation noise,and polysemy and other issues.In view of the shortcomings of the existing technology,the text has conducted in-depth research on multilingual representation methods,and proposed a multilingual segment representation model and a cross-language text classification model based on segment semantics,and verified the correctness and effectiveness of the model through experiments.The main work and innovation of the paper are as follows:(1)In order to solve the incompatibility of feature spaces in different languages,a multilingual segment representation model based on fine-tuning of pre-trained language model is proposed.Most of the existing text representation models use a single language training,while the universal language model pre-trained using multilingual corpus lacks knowledge of semantic information of different topics and semantic relevance of different languages.To this end,based on the general language model and the characteristics of the cross-language text classification problem,this paper proposes four fine-tuning methods for translation language model,single-language text classification,mixed-language text classification and cross-language sentence pair classification.It enables the model to learn the correspondence between different languages with the same semantics and blur the differences and boundaries between different languages.(2)A cross-language text classification model based on segment semantics is proposed.Existing text classification mostly bases on the word level,the limitation of this implementation lies in the correctness of word segmentation processing and the problem of polysemy in words.To this end,this paper proposes a text classification model based on paragraph semantics.This model first divides the text into paragraphs and uses a multilingual paragraph representation model to generate paragraph embedding vectors,and then uses a two-way LSTM network to extract the two-way sequence features of the text.Based on the Attention mechanism,this paper proposes a paragraph semantic weight calculation method,all paragraphs are semantically weighted to obtain the paragraph semantic features of the text.Finally,the three features of the text are used for cross-language text classification.(3)Two Chinese and English bilingual parallel corpora with different alignment levels were constructed,and three experiments were conducted based on the constructed parallel corpus to verify the effectiveness of the model proposed in this paper.The experimental results show that the multi-lingual segment representation model based on fine-tuning improves the multi-lingual classification effect to a certain extent.The cross-language text classification model based on segment semantics has good classification results in multilingual text classification tasks and cross-language text classification tasks.Among them,the accuracy reaches 90.35% on multilingual classification tasks and 91.18% on cross-language classification tasks. |