Font Size: a A A

Research On Named Entity Recognition For Chinese Weibo Text

Posted on:2022-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:K LiuFull Text:PDF
GTID:2518306752454304Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Named Entity Recognition refers to the identification of named entities with specific meanings in the text.In the Internet era,social media represented by Weibo is generating a large amount of text data at all times,and the named entity recognition of these texts can obtain considerable application value.Currently,the named entity recognition of Chinese Weibo texts still faces some difficulties:1.Weibo texts have problems such as irregular wording and too many colloquialism.The recognition accuracy of these irregular texts is still very low.2.Compared with English,Chinese words lack obvious boundary information,so Chinese named entity recognition usually uses word vector embedding based on character granularity.This will result in a wealth of vocabulary information that cannot be used.3.The traditional Named Entity Recognition method has the problems of poor parallelism and low training speed when training massive microblog data.To solve the above problems,this paper proposes a text normalization method suitable for Weibo text and a Named Entity Recognition model that incorporates dictionary information.The main work and contributions of this paper are as follows:1.To solove the problem that Weibo texts have many irregular words,a method for identifying non-standard words based on multiple statistics is proposed.And through the distributed word vector technology to generate word vectors,the cosine distance between word vectors is used as the similarity criterion to construct a normalized dictionary.Perform non-standard word replacement to achieve text standardization.Experiments show that this standardization method can effectively identify and replace non-standard words in Weibo.In this way,the accuracy of named entity recognition is improved,and the recognition accuracy of the standardized Weibo text under the BiLSTM-CRF framework is increased by 5.2%.2.To solve the problem that traditional Named Entity Recognition methods cannot use vocabulary information,Based on the BERT pre-training model,the Soft Lexicon dictionary integration method is used to incorporate vocabulary information.Using different dictionary integration methods to conduct contrast experiments.The experimental results on the Weibo dataset show that Compared with the Lattice-LSTM model,this improved dictionary fusion method not only has a 0.64% improvement in F1 value,but also has an increment by more than 50% on training efficiency.3.To solve the problems of low training speed and complex internal structure of traditional named entity recognition methods,the bidirectional QRNN network is used instead of BiLSTM network to realize the parallelization of feature extraction.Subsequent experiments show that,compared with the BiLSTM-CRF network,the BiQRNNCRF network can shorten the training time by more than 50%,and has the same recognition accuracy.4.In order to make up for the accuracy loss caused by fixed BERT parameters,So that the model can pay attention to the context semantics when performing sequence labeling,A self-attention mechanism is added after the neural network layer to assign greater weight to words that play a more important role in label prediction.Subsequent experiments in the Weibo dataset also proved that adding a self-attention layer can improve the recognition effect by 0.62% when the structure of other models is exactly the same.
Keywords/Search Tags:Long Short-Term Memory Network, Self-Attention, SoftLexicon, deep learning, Quasi-Recurrent Neural Network, Named Entity Recognition
PDF Full Text Request
Related items