Font Size: a A A

Chinese Text Information Extraction Based On Deep Learning

Posted on:2022-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ZhangFull Text:PDF
GTID:2518306497971589Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Information extraction aims to extract specific factual information from texts,which is used to construct structured data.As one of the main research directions of natural language processing,it includes three subtasks:named entity recognition,relation extraction and event extraction.Currently,information extraction based on deep learning has achieved some results.However,the existing Chinese information extraction methods are mostly based on word vector representation to improve the model,while ignoring the importance of Chinese text representation.At the same time,word-based models are highly dependent on the quality of word segmentation,and Chinese words also suffer from the problem of ambiguity.To this end,we analyze the vector representation of Chinese text from the perspective of linguistics,and study the relation extraction and event extraction based on deep learning methods,and then propose an entity relation extraction model and an emergency event extraction model.The main research contents are as follows:(1)In response to the problem of Chinese language complex structure,we study the semantic character,word vector representation and some syntactic structure vector representations,and illustrate the effect of each feature vector on the Chinese text representation.Meanwhile,the external language knowledge of entity sense is introduced,which is applied to the task of relation extraction to provide supporting information for the entities marked in each Chinese sentence and help reduce the interference of ambiguity.A method for acquiring entity sense is proposed and the cosine similarity algorithm is determined through comparative experiments to select the accurate entity sense,which is used as an additional feature vector to be added to the Chinese text representation.(2)In response to the problem of Chinese word segmentation errors and polysemy,we use a triple-tuple composed of the character vector representation and the entity sense representation as input,and on this basis a Chinese entity relation extraction model is proposed which integrates multiple features of character,word and entity sense.Four sub-models are constructed:The Bidirectional Long-Short Term Memory Network Based on Attention(Att-BLSTM)capturing character features,C-Att-BLSTM capturing word features,Att-BLSTM capturing e1 sense features and e2 sense features.Through the concatenation and linear weighted summation method,these three different levels of features are fused as a vector,which is fed into the softmax for the relation classification.Finally,experiments are carried out on the Chinese public dataset San Wen to prove that the multi-feature fusion model achieves state-of-the-art results and some ablation experiments are also conducted to analyze the effectiveness of each feature and the superiority of multi-feature fusion.(3)Chinese emergency event extraction is beneficial to improve people's ability to respond to changes in dangerous environments,which has high research value.Therefore,we put forward a Chinese emergency event extraction model,named Condition Random Field with Lattice Long-Short Term Memory Network(LSTM-CRF)based on the representation of character vector and word vector as input.We preprocess a Chinese emergency report in XML format to get all sentences,and apply the technic of BIO to obtain a character sequence and a label sequence for each sentence.The Word2Vec is used to embed character vector and word vector,basing on this input we construct Lattice LSTM on the cell element of character vector to add the existing word information,and then use CRF to capture the dependence between characters.Experiments on the public CEC corpus prove that the overall performance of the proposed model is better than any other latest methods,and the influence of external semantic features is further studied.It turns out that our model can obtain the best results based on simple character vector and word vector.
Keywords/Search Tags:Deep Learning, Chinese Text Representation, Relation Extraction, Event Extraction
PDF Full Text Request
Related items