Research And Application Of Multilingual Text Embedding Model

Posted on:2021-03-04

Degree:Master

Type:Thesis

Country:China

Candidate:Z Ren

Full Text:PDF

GTID:2518306503499394

Subject:Computer technology

Abstract/Summary:

Natural language processing attempts to use a computer to process the semantic information behind the text.However,computers cannot understand natural languages like humans.In order to do basic semantic expression,the human natural language must be expressed as a mathematical form before computers can handle it.As the input object of many high level NLP tasks,text representation is one of the most important basic NLP tasks.These mathematical tools used to represent natural language are the language models.Among various forms of language models,text embedding models represent different granularity of language units as a set of fixed-length continuous real numbers.Human languages come in a wide variety of languages and usually have different characters.We expect that computers can not only handle text in a single language,and also to understand the text in multi languages and to find the semantic similarity of these different language texts.The topic of this thesis came from my actual work requirments: cross-language similar keyword recommendation,text retrieval and classification.The pain point of these tasks is that traditional retrieval technology can only retrieve texts containing the exact keywords of user query,and cannot automatically expand to related or synonymous keywords,nor can it handle the expression of the same thing in different languages.The above application scenarios can all be summarized as one problem: using multilingual textual representation to represent semantic similarity.The semantic similarity between words can be used to recommend similar keywords,and the semantic similarity between long texts such as paragraphs and chapters can be used for retrieval and classification.The existing research on this subject has some problems.Most of them focus on the bilingual models and rely heavily on high-quality parallel corpus.These shortcomings make the existing methods difficult to be practice in actual production.In order to meet these applications and solve existing problems,this thesis proposes a method to unify multiple pairs of bilingual parallel corpora into the same semantic space,thereby achieving comparability.The general direction of this article to achieve such a method is to first obtain serveral single language models of different language each,and then merge them through parallel corpus.In this process,a multilingual embedding model based on pseudo-mono corpus and a multilingual embedding model based on multiple pairs of bilingual parallel corpus are obtained.And it is compatible with the bilingual corpus out-of-dictionary vocabulary(OOV).In the end,this thesis trains and obtains two multi-language embedding models,and conducts experimental verification on them,and finally applies this model to the actual production scenario originally proposed.The cross-language word similarity recommendation in this experiment can reach an accuracy rate of over 63.5%.The model trained on the trilingual parallel corpus can reach an average 69%,and the highest accuracy of bilingual similarity recommendation is 85.7%.Using the word embedding model trained in this article to conduct a patent text multi-IPC label classification test,the classification accuracy rate of the IPC ministry level reached 78.3%,the recall rate was 63.6%,the F value was 70.2%,and the IPC subclass classification accuracy rate reached 65.6%,recall The rate is 29.7% and the F value is 40.9%.Using this embedded model for indexing compared with the traditional retrieval tool Solr,the recall rate is comparable,and the text vector recall rate in this paper is better than Solr retrieval in the relatively large sub-field of chemistry and chemical engineering in the sample.These data show that the method used in this thesis is feasible and effective.The application in actual projects also proves the availability of the method described in this thesis.

Keywords/Search Tags:

Multi-language text embedding, sentence-level parallel corpus, skip-gram

Related items

1	Unsupervised Extractive Text Summarization Using Sentence Embedding
2	Parallel Corpus Mining System Based On Cross-lingual Sentence Embedding
3	Research On Jointly Learning Word Embeddings And Latent Topics In Text
4	Research On Sentence Alignment Method Based On Cross-lingual Word Embeddings
5	Research On The Construction Of Ancient English Parallel Corpus Based On Multi-Level Automatic Alignment
6	Improved Sentence Embedding Based On BERT And Prompt-learning
7	Research On Language Identification Of Social Media Short Text Based On N-Gram Vector Feature
8	A Study And Implementation Of Document Clustering Based On Word Embedding
9	Web-oriented Multilingual Parallel Sentence Pairs Mining Techniques
10	Language Independent Text Categorization