Font Size: a A A

Research On Word-vector-representation-based New Word Discovery And Name Entity Recognition

Posted on:2018-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y DuFull Text:PDF
GTID:2348330512483237Subject:Engineering
Abstract/Summary:PDF Full Text Request
The mining of structured data in data mining is relatively mature,but unstructured data mining analysis faces many challenges.Text data is an important piece of structured data.The mining of text data is also faced with a series of problems such as Chinese word segmentation,named entity recognition,entity relation extraction,semantic understanding,emotion analysis and so on.However,the word segmentation is the basic step in Chinese text data mining,because people are constantly in the creation of new words,these new words cannot be completely included,so lead to word segmentation Error;such errors often result incorrect named entity recognition.Therefore,the new word recognition has become a difficult and bottleneck problem in text mining.In recent years,word vector representation obtained by training neural network language model can be a good representation of the semantic relationship between words and words,inspired by this,this paper introduces the word vector representation into the Chinese new word discovery recognition,and proposes an unsupervised new word discovery method based on the combination of word vector representation and n-gram.First,we trained the neural network language model and mapped the word to a highdimensional space.We also compared the effects of the word vector obtained from the Skip-gram model and the CBOW model on these new word results,and found that the Skip-gram model can achieve better results.Second,considering several adjacent words often appear together in different word sequences,then they must have a relationship.Based on the algorithm of association rules,this paper designed an efficient n-gram mining algorithm.Third,we used extracted n-gram as a candidate string and pruned the candidate string by the trained word vector,so as to obtain the new word results.We designed the pruning algorithm and compared the effects of different vector similarity measures on the final results,and found that the cosine similarity was the best.At the same time,this article also compared with other new word discovery methods and confirmed the validity of this method.Finally,we conducted a little further work on the results of the new words.We used the conditional random field to classify the results of the new words to obtain the named entity.The main contributions of this paper are two aspects: First,this paper introduces the word vector into the field of Chinese new word recognition,and combines the word vector and n-gram to propose a new unsupervised new word recognition algorithm.Second,on the result of the discovery of new words,this paper uses the conditional random field to classify the new words to identify the named entity,and put forward a new practice for named entity recognition.
Keywords/Search Tags:natural language processing, new word discovery, word vector, named entity recognition, n-gram
PDF Full Text Request
Related items