Research On Word-vector-representation-based New Word Discovery And Name Entity Recognition

Posted on:2018-03-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y Du

Full Text:PDF

GTID:2348330512483237

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

The mining of structured data in data mining is relatively mature,but unstructured data mining analysis faces many challenges.Text data is an important piece of structured data.The mining of text data is also faced with a series of problems such as Chinese word segmentation,named entity recognition,entity relation extraction,semantic understanding,emotion analysis and so on.However,the word segmentation is the basic step in Chinese text data mining,because people are constantly in the creation of new words,these new words cannot be completely included,so lead to word segmentation Error;such errors often result incorrect named entity recognition.Therefore,the new word recognition has become a difficult and bottleneck problem in text mining.In recent years,word vector representation obtained by training neural network language model can be a good representation of the semantic relationship between words and words,inspired by this,this paper introduces the word vector representation into the Chinese new word discovery recognition,and proposes an unsupervised new word discovery method based on the combination of word vector representation and n-gram.First,we trained the neural network language model and mapped the word to a highdimensional space.We also compared the effects of the word vector obtained from the Skip-gram model and the CBOW model on these new word results,and found that the Skip-gram model can achieve better results.Second,considering several adjacent words often appear together in different word sequences,then they must have a relationship.Based on the algorithm of association rules,this paper designed an efficient n-gram mining algorithm.Third,we used extracted n-gram as a candidate string and pruned the candidate string by the trained word vector,so as to obtain the new word results.We designed the pruning algorithm and compared the effects of different vector similarity measures on the final results,and found that the cosine similarity was the best.At the same time,this article also compared with other new word discovery methods and confirmed the validity of this method.Finally,we conducted a little further work on the results of the new words.We used the conditional random field to classify the results of the new words to obtain the named entity.The main contributions of this paper are two aspects: First,this paper introduces the word vector into the field of Chinese new word recognition,and combines the word vector and n-gram to propose a new unsupervised new word recognition algorithm.Second,on the result of the discovery of new words,this paper uses the conditional random field to classify the new words to identify the named entity,and put forward a new practice for named entity recognition.

Keywords/Search Tags:

natural language processing, new word discovery, word vector, named entity recognition, n-gram

PDF Full Text Request

Related items

1	Study On Chinese Named Entity Recognition
2	Research On Named Entity Recognition For Science And Technology Terms Based On Dependent Entity Word Vector
3	Research On Chinese Named Entity Recognition Based On Deep Learning
4	Research On Named Entity Recognition Based On Neural Network Ensemble
5	Research On Entity Extraction In Signal Processing Based On Dependency Word Vector
6	A Study On Chinese Named Entity Recognition
7	Research On Chinese Named Entity Recognition Based On Feature Enhancement
8	Research On Jointly Learning Word Embeddings And Latent Topics In Text
9	Research And Implementation Of Mining Bilingual Named Entities From Large-Scale Web Pages
10	Domain Adaptation Research And Application Of Named Entity Recognition