Font Size: a A A

The Key Technologies Of Representation Of Tibetan Word Vector

Posted on:2019-07-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z J CaiFull Text:PDF
GTID:1368330578464336Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The vector representation of a language unit is the fundamental work of machine learning,its goal is to represent the language units in an optimized vector form so that the computer will understand the natural language better.In recent years,with the de-velopment of neural network technology,vector representation has been playing an important role in the field of natural language processing.Words,sentences and doc-uments representation in English and Chinese have achieved fruitful results and have been widely used.The vector representation technology of Tibetan language units is in the stage of exploration and initiation,its research has important theoretical signif-icance and wide application value for the analysis of Tibetan language features and the use of deep learning techniques to deal with Tibetan language.This dissertation draws on the word vector representation technology of English and Chinese,combines the characteristics of Tibetan language to study the key tech-nologies of Tibetan word vector by the Tibetan character component decomposition,Tibetan text segmentation.Tibetan word vector evaluation,and Tibetan word vector representation.The main work includes:(1)Decomposition of Tibetan character componentsThe components are the smallest units of Tibetan language,which contains a wealth of meaning information,but the Tibetan text being input into the computer is a whole.To get the meaning of a component,the whole character should be decom-posed into components.This dissertation summarizes the structure of Tibetan charac-ters and the principles of character formation,classifies Tibetan font.On this basis,the decomposition model and algorithm of Tibetan character components are designed.Taking the statistical analysis of Tibetan fonts distribution as an example,the validity of the character components decomposition algorithm is verified,and the general dis-tribution law of Tibetan character form is obtained.(2)Tibetan text segmentationWords are the most basic processing unit in natural language processing.Tibetan text is a sequence of characters,there is no separation mark between words,so text segmentation is needed.Aiming at the main problems existing in Tibetan text seg-mentation.this dissertation raises text and block segmentation schemes based on rules by analyzing the current situation of Tibetan text segmentation.In terms of sentence segmentation,a Tibetan sentence block segment algorithm based on critical library is designed.In the aspect of block segmentation,the construction method of the main dictionary library is analyzed,and the index query algorithm,the compact word recognition and restoration algorithm,the multi-strategy tightening lattice recognition algorithm,the package algorithm of unregistered word recognition,and the local high frequency word priority algorithm of ambiguous resolution are designed.(3)Tibetan word vector evaluationThe goal of word vector assessment is to evaluate the performance of word vec-tor model,which includes internal task assessment and external task assess-ment.Internal task evaluation evaluates the performance of vector models by word similarity,relevance,and analogy evaluation sets,it is the most widely used method for word vector evaluation.Since the study of Tibetan word vector is in its infancy,there is no evaluation set to evaluate the performance of Tibetan word vector.This dissertation draws on the construction methods of English and Chinese word vector evaluation sets,designs the similarity and relevance task evaluation set construction scheme of Tibetan word vector.Based on this scheme,the Tibetan word similarity evaluation set TWordSim215 and relevance evaluation set TWordRel215.are estab-lished,its validity is validated.(4)Tibetan word vector representationIn traditional neural networks,words,as atomic objects,establish word repre-sentations based on contextual information.It is better to fuse the sub-word level in-formation to capture the meaning of words.Consider the characteristics of Tibetan,this dissertation proposes a Tibetan word vector model based on components,and a Tibetan word vector representation model fusing components and characters' infor-mation.The component-based Tibetan vector model build vectors by characters and words' component information,it can better display the positional features of the components and the rules for adding components,and it has achieved good results in the spelling check of Tibetan characters.The Tibetan word vector representation model of fusing components and characters' information incorporates positional in-formation of components and characters into the word vector representation,thus has a significant improvement in performance over traditional methods.
Keywords/Search Tags:Natural Language Processing, Neural Network, Tibetan, Distributed Representation, Word Vector
PDF Full Text Request
Related items