Font Size: a A A

Study On Efficient Representation Learning Algorithms Of Natural Language

Posted on:2022-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:H H SongFull Text:PDF
GTID:2518306536963699Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The development of natural language processing technology has changed the design of intelligent software profoundly,e.g.,search engines and machine translation.The representation study is an important driving force.According to the semantic levels of expression and learning algorithms,representations are divided into two classes: basic representations and compound representations.The representative basic representations are word representations,which are regarded as the smallest semantic units.In general,they are generated directly from large-scale unlabeled corpora.The compound representations are learned from the basic representations,which express higher-level semantics.Sentence representations belong to this class.Currently,high-dimensional floating-point representations generated using deep learning techniques perform well in many tasks.However,with the requirements for running intelligent applications in resource-constrained environments,general basic representations show shortcomings such as occupation of large memory space and difficulties in applications.As the scale of data grows,compound representations expose low time efficiency in tasks such as clustering.It is valuable to study efficient basic representation and compound representation learning algorithms.This thesis regards words and sentences as study objects,and proposes two efficient representation learning algorithms,which are summarized as follows:First,aiming at the inefficiency of word representations,this thesis proposes an efficient word representation learning algorithm based on the right triangle similarity transformations(RTST).RTST first samples the orthogonal vector pairs from the original word representation matrix,and then feeds these vector pairs into a Siamese neural network to reduce dimensionality.Before and after dimensionality reduction,subtraction is performed to obtain the hypotenuse vectors,which are combined with the orthogonal vector pairs to form right triangles.Minimizing the mean square error of these cosine angles to obtain a neural network-based similarity transformation,which ensures that the local space is unchanged and the order of the vector norms is consistent.Besides,the neural network also uses the activation function to guide the vector binarization.Extensive experimental results reveal that RTST and its components are effective.The visual analysis qualitatively shows that the generated word representations have patterns.The efficiency analysis quantitatively concludes that the space utilization of learned representations is high.Second,aiming at the inefficiency of sentence representations,this thesis proposes an efficient sentence representation learning algorithm based on "construction-decomposition"(C-D).Anchor vectors are important concepts in C-D,and they are expected to be consistent,unbiased and complete.The C-D is a batch learning algorithm.First,calculate the similarity matrix between a batch of sentence representations and anchor vectors.Second,perform matrix decomposition and learn efficient sentence representations,where the anchor vectors processed by PCA are regarded as fixed factors of the similarity matrix.The C-D uses a discrete coordinate descent algorithm to learn efficient sentence representations.First,mathematically derive the two-valued optimal solution for each dimension.Second,update one value while fixing other values,and update all dimensions cyclically.This thesis studies and analyzes the algorithm theoretically and experimentally.Results reveal that the algorithm and its components are effective.Case study evaluates sentences qualitatively and concludes that the learned representations encode rich semantic information.The efficiency analysis quantitatively concludes the conclusion that the generated sentence representations have high space utilization and fast speed.
Keywords/Search Tags:Word Representations, Sentence Representations, Manifold Learning, Siamese Neural Networks, Anchor Vectors
PDF Full Text Request
Related items