Font Size: a A A

Research On Chinese Word Segmentation Method Based On Word Embedding

Posted on:2018-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2348330542487331Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As people are eager for the automation of computers,and the Artificial Intelligence has been born at the right moment.Natural language processing is one of the most important research part of the Artificial Intelligence,which has a lot to be research on the efficiency and accuracy when process the natural language.The fundamental and the first step is word segmentation.Among the world's main stream natural languages,Chinese unlike other languages such as English,which sentences are written as character sequences without word delimiters.Thus,it has difficulty in Chinese word processing and recognition.As the combination of long history and the contemporary era,the phenomenon of ambiguous and obscure has shown..Nonetheless,the efficiency and accuracy can still be improved.Traditional word segmentation method needs to design feature templates manually and cost a lot of time,the dimension of word representation which used in segmentation model are large and cause the curse of dimensionality.In addition,traditional word representation method ignore infection and relation between words and occur the word gap problem,hence the effectiveness of segmentation model is relatively low.Word embedding method based on neural language model can fight with the curse of dimensionality and too much time consuming on human label,but can still improve on the effectiveness and accuracy.After researched the history and development of domestic and overseas,and studied relative theoretically knowledge,this thesis focused on the following research.Aim at the Chinese character representation method and propose a word embedding training method contains Chinese root information,this method make use of Chinese character morphology and the semantic features contained in character root,meanwhile,combine context information and get a set of word embedding.Take advantage of neural network architecture and train a set of word embedding to mapping large data into a smaller demCompare the traditional word segmentation model to the model based on neural network,research the advantage of use word embedding,and choose the research goal of the thesis.The proposed algorithm focus on the difficulty of Chinese word segmentation,take advantage of neural network and train a set of distributed representation of words,by mapping the large amount of data to a fixed dimensionality vector space.The distributedrepresentation can significantly reduce the compute time of the previous algorithm and improve the efficient.The trained distributed representation is prepared for the afterwards word segmentation.Construct the segmentation architecture and compute the matching probabilities of the word,and get the result.Through training the architecture and test it on test set.Finally compare and analysis the experiment result to test and verity the feasibility and superior of the algorithm.
Keywords/Search Tags:Chinese word segmentation, Statistic language model, Neural network, Word embedding
PDF Full Text Request
Related items