Font Size: a A A

Research On Chinese Word Segmentation And Keyword Extraction Model Based On Deep Learning

Posted on:2020-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:D D HuangFull Text:PDF
GTID:2428330575456640Subject:Mathematics
Abstract/Summary:PDF Full Text Request
With the breakthrough of artificial intelligence in more and more fields,the important field of natural language processing based on deep learning has attracted the attention of many researchers.As one of the most important basic work in Chinese natural language processing,word segmentation has achieved some results.The keyword automatic extraction task refers to extracting words from the document collection that can represent the importance of its central theme,and is a necessary and key link for text mining tasks such as text abstracts.From the perspective of machine learning,both Chinese word segmentation and keyword automatic extraction can be converted into sequence tagging tasks.A sequence annotation task is the process of assigning a specific label to each element in an observation sequence in a given set of labels.In view of the fact that the traditional sequence labeling method requires the lack of wide applicability of artificial construction features,this paper borrows the advantage of deep learning automatic learning task features,and uses deep learning technology to study Chinese word segmentation and keyword automatic extraction tasks.In the Chinese word segmentation technology,based on the existing work,this paper proposes a method of integrating BiLSTM-CRF(bidirectional long-term and short-term memory neural network-conditional random field)model and attention mechanism.The specific work is as follows:(1)The BiLSTM-CRF model selected in this paper has great advantages in solving the problem of sequence labeling in natural language processing.It uses the bidirectional LSTM neural network to preserve both the previous context in the text sequence and the future context information.Then,the label offset problem in the sequence labeling is solved by the conditional random field.(2)Propose to incorporate the attention mechanism method into the model,use the attention mechanism to calculate the importance of the correlation between the input and output of the Bi-LSTM model,and obtain the overall characteristics of the text according to the degree of importance to highlight the specific The importance of the word for the entire text.(3)This paper explores the contribution of the forward LSTM and the backward LSTM to the whole model in the network,and adjusts the weight matrix of the forward LSTM and the backward LSTM to further improve the word segmentation performance.(4)Adding a denoising layer to the model in this paper to filter the information in the fixed window,so that the context words in the input word window appear with a certain probability,and no longer be affected by the context words to achieve denoising.The effect improves model segmentation performance.This paper conducted experiments on the MSRA corpus,PKU corpus,and People's Daily 2014 public datasets using the improved attention-BI-LSTM-CRF model.The experimental results show that the improved model and training method can effectively solve the word segmentation problem in Chinese natural language processing and improve the accuracy.The classification idea of the traditional supervised keyword extraction method assumes that the candidate words are independent of each other,and the relationship between the words is lost.The semantic information of the text structure is lost.In view of this deficiency,this paper explores a new method based on the feature learning method of neural network.The idea is to transform the keyword extraction task into a sequence labeling problem,and use the combination of BILSTM neural network and CRF to implement automatic keyword extraction for keyword extraction tasks.Based on the BILSTM-CRF keyword extraction model,this paper does two things in text preprocessing and data annotation:Firstly,based on the distribution hypothesis(the semantics of the word is determined by its context),the challenge of text preprocessing in the keyword extraction task is not accurate enough.With the CBOW model as the benchmark model,the word vector pre-training representation of the word combination is proposed.method.This method integrates the words in the context into the word representation space,and better models the words by means of the smoothing effect of the words.In order to make the word representation contain richer semantics,and to learn from the distribution semantics to the word level to optimize the performance of the model,this paper combines the word representation without semantics but with better performance and the word representation with semantic information.Although the word itself does not have semantic information,when the two are in the same semantic space,pre-training is performed by means of word combination,which can model the word more effectively.Secondly,this paper considers the phenomenon that there may be personal subj ectivity or can not guarantee the concentration of attention in the process of labeling data,which causes mislabeling and missing labels,and proposes a kind of supplement to training data.The method of dictionary labeling to avoid as much as possible the quality of the labeling that affects the data.We introduce a variable P to represent the ratio of the number of keywords w marked in the online annotation to the number of occurrences of w in the corpus.Then set a threshold,count p(w)of all keywords,and filter out words above the set threshold as supplementary annotations to add to the supplementary dictionary.The automatic supplemental dictionary is used to supplement the marked data to reduce the error of the sample data labeling.In this paper,the word combination training and the word representation of individual training are compared on the text classification task.The experiment shows that the word vector obtained by the combination of words has obvious advantages in the keyword extraction task.The experimental results show the combination of word training.Effectiveness.At the same time,we use the conditional random field as a reference model to construct feature templates and conduct comparative experiments on the labeled data sets.The experimental results show that the BILSTM-CRF keyword extraction model proposed in this paper is based on the CRF model and also proves the superiority of BILSTM-CRF in this paper.
Keywords/Search Tags:LSTM, CRF, Attention-mechanism, Chinese-word-segmentation, Keyword extraction
PDF Full Text Request
Related items