| Natural language processing attempts to use a computer to process the semantic information behind the text.However,computers cannot understand natural languages like humans.In order to do basic semantic expression,the human natural language must be expressed as a mathematical form before computers can handle it.As the input object of many high level NLP tasks,text representation is one of the most important basic NLP tasks.These mathematical tools used to represent natural language are the language models.Among various forms of language models,text embedding models represent different granularity of language units as a set of fixed-length continuous real numbers.Human languages come in a wide variety of languages and usually have different characters.We expect that computers can not only handle text in a single language,and also to understand the text in multi languages and to find the semantic similarity of these different language texts.The topic of this thesis came from my actual work requirments: cross-language similar keyword recommendation,text retrieval and classification.The pain point of these tasks is that traditional retrieval technology can only retrieve texts containing the exact keywords of user query,and cannot automatically expand to related or synonymous keywords,nor can it handle the expression of the same thing in different languages.The above application scenarios can all be summarized as one problem: using multilingual textual representation to represent semantic similarity.The semantic similarity between words can be used to recommend similar keywords,and the semantic similarity between long texts such as paragraphs and chapters can be used for retrieval and classification.The existing research on this subject has some problems.Most of them focus on the bilingual models and rely heavily on high-quality parallel corpus.These shortcomings make the existing methods difficult to be practice in actual production.In order to meet these applications and solve existing problems,this thesis proposes a method to unify multiple pairs of bilingual parallel corpora into the same semantic space,thereby achieving comparability.The general direction of this article to achieve such a method is to first obtain serveral single language models of different language each,and then merge them through parallel corpus.In this process,a multilingual embedding model based on pseudo-mono corpus and a multilingual embedding model based on multiple pairs of bilingual parallel corpus are obtained.And it is compatible with the bilingual corpus out-of-dictionary vocabulary(OOV).In the end,this thesis trains and obtains two multi-language embedding models,and conducts experimental verification on them,and finally applies this model to the actual production scenario originally proposed.The cross-language word similarity recommendation in this experiment can reach an accuracy rate of over 63.5%.The model trained on the trilingual parallel corpus can reach an average 69%,and the highest accuracy of bilingual similarity recommendation is 85.7%.Using the word embedding model trained in this article to conduct a patent text multi-IPC label classification test,the classification accuracy rate of the IPC ministry level reached 78.3%,the recall rate was 63.6%,the F value was 70.2%,and the IPC subclass classification accuracy rate reached 65.6%,recall The rate is 29.7% and the F value is 40.9%.Using this embedded model for indexing compared with the traditional retrieval tool Solr,the recall rate is comparable,and the text vector recall rate in this paper is better than Solr retrieval in the relatively large sub-field of chemistry and chemical engineering in the sample.These data show that the method used in this thesis is feasible and effective.The application in actual projects also proves the availability of the method described in this thesis. |