Font Size: a A A

Design And Implementation Of Information Extraction Based On Deep Learning

Posted on:2020-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:B C YeFull Text:PDF
GTID:2428330578958446Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Information refers to objects transmitted and processed by systems such as audio,news,and communication,refering to everything that human society spreads.As the mathematician Shannon said: "Information is something used to eliminate random uncertainty." The importance of information is self-evident.In the new century,with the rapid development of Internet technology,information has grown exponentially with electronic text as a carrier,making it extremely difficult for users to obtain important information from it.In the face of the era of big data,how to intelligently help people get the information they need from massive information has become an important topic for scientists to study.Information extraction technology came into being.Information extraction is a text processing technology which refers to extracting the specified type fact information from natural language texts such as entities,relationships,events,etc,and forming structured data output.According to the ACE(Automatic Content Extranction),information extraction mainly studies the four fields of entity identification,entity relationship extraction,referencing disambi guation,and event extraction.Among them,named entity identification and entity relationship extraction are the most important parts of these technical fields.With the rapid development and wide application of the Internet,a typical task of information extraction is to extract content of interest from semi-structured or even unstructured massive data and save it as a structured data form.Academic search,commodity search,text mining,and knowledge base construction all require the support of information extraction.Named entity recognition and entity relationship extraction are two important subtasks that reflect information extraction.Named Entity Recognition refers to identifying entities with special meanings in text,such as names of people,plac es,institutions,proper nouns,and so on.Relationship extraction refers to the intelligent identification of the related triplet relationship consisting of a pair of entities and the relationship of the pair of entities.Broadly speaking,entity relation ship extraction covers named entity recognition tasks.A joint framework for named entity recognition and relationship extraction is used to detect entities and their categories,and to identify semantic relationships between them from text.This is an important issue in knowledge extraction and plays a vital role in the automation of knowledge base construction.As the demand for information extraction has increased dramatically today,as two important tasks,named entity recognition and relationship extra ction are of great significance.Domestic and foreign scholars have also conducted in-depth research on these two issues.The research methods of named entity recognition are mainly divided into two categories,one is the traditional linear statistical model,and the other is the neural network architecture model.The research on entity relationship extraction is mainly divided into five categories,one is based on the artificial construction of the regularization matching model system,the second is the semi-supervised method,the third is the unsupervised method,the fourth is the remote supervision method,and the fifth is the supervision method,also known as Relationship classification method.The most common construction method available is the supervised learning method.From the existing research,the traditional methods require a large number of manual extraction features and poor portability.Most of the entity identification and entity relationship extraction tasks are based on existing open evaluation corpora,but the existing data corpus may not meet the needs of research scholars,and most of the labeled training corpus In English text.The need for a model that can simultaneously identify Chinese and English entities and relationships and the corresponding labeling of training data is extremely urgent.Moreover,the traditional entity recognition and relationship extraction system combines two independent subtasks in series.This separation framework makes the task easy to handle,but such a concatenation mode is difficult to reflect the correlation between the two subtasks,resulting in the performance of entity recognition may affect the performance of the relationship classification and is prone to error accumulation.The rapidly developing deep learning neural network algorithm provides theoretical support and technical support for large-scale data processing.Deep learning neural networks have different advantages in different types of tasks.Convolutional neural networks are the first to extract image features.In recent years,they have also shown good information feature extraction performance in natural language processing.The Recurrent Neural Network achieves a wide range of applications by training appropriate gate weights to maintain long-term memory.Conditional Random Field is a probabilistic structural model used to label or partition sequence structure data,which can play an important role in text data analysis with word order problems.There are many other deep learning neural network algorithms with excellent performance,which have been widely studied and used in the field of natural language processing.In view of the above situation,the research content of this thesis is to use deep learning to realize the model of entity recognition and entity relationship extraction.The main work involved includes:1.The identification methods and theoretical basis of various named entities are analyzed in detail,and each model is made basic simulation experiments to study their characteristics,and proposes a multi-model fusion method for named entity recognition.2.Multiple entity relationship methods are analyzed and their advantages and disadvantages are pointed out,and a Chinese entity relationship model for two-way GRU is designed.The model adds the attention mechanism and combines the word vector and position vector features to extract the Chinese entity relationship.The feasibility of the model is verified by experiments.3.Through the analysis and research of various entity recognition and entity relationship extraction models,a joint extraction model of entity recognition and entity relationship is designed.This end-to-end coding and decoding model adopts a new labeling strategy.First,in the data preprocessing stage,the words in the sentence are converted into word vectors and character vectors and they are concatenated in series,then encoded by the cyclic neural network,and then decoded.The layer performs physical sequence annotation through the conditi onal random field and extracts the entity relationship through the convolutional neural network.This joint extraction model outperforms the traditional pipeline extraction model,and the new labeling strategy is superior to most other joint extraction models.Through comparative experiments,the F1 value of the model is about 3% higher than other models.4.In this paper,based on the experimental data imbalance problem,a data preprocessing algorithm based on undersampling(DPA algorithm)is designed.Through experiments,the accuracy of the DPA algorithm is improved by 2%.
Keywords/Search Tags:Information extraction, Named entity recognition, Entity relationship extraction, Deep learning, Conditional random field
PDF Full Text Request
Related items