Font Size: a A A

Research On The Method Of Attribute Extraction In Tourism Field

Posted on:2021-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuFull Text:PDF
GTID:2428330620976436Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of computer application technology and the rise of artificial intelligence technology,it is an important research problem to how to quickly and efficiently obtain valuable data from the increasing Internet data in the field of natural language processing.Attribute extraction,that is,entity attribute value extraction,is to automatically extract relevant attribute values of entities from unstructured text or other data sources,and is the basis of natural language processing tasks such as information extraction,knowledge graph,and automatic question answering,etc.In this paper,we study the extraction method of entity attributes in tourism field.Based on the BLSTM-CRF model,we propose a new attribute extraction method combining residual convolution neural network and self-attention mechanism.In the past,most of the work on attribute extraction was based on the closed world hypothesis,or introducing the dictionaries,or introducing the artificial features.However,these methods can only discover the existing attributes,not the new ones.In addition,these methods also need a large and expensive labor cost.In this paper,the problem of attribute extraction is transformed into a sequence labeling problem,and the experimental results are excellent.The main work of this paper is as follows:First of all,through the crawler technology,we crawled the scenic spot texts of encyclopedia websites and vertical tourism websites,and screened,sorted and annotated them to build the attribute extraction data set of tourism field,and divided the data set into training set,verification set and test set according to the proportion of 80%,10% and 10%,so as to prepare for the next attribute extraction work.Secondly,based on the BLSTM-CRF model,this paper first proposes a based on ResCNN attribute extraction model.This model uses a convolutional neural network with residual learning function to extracts local features from the pre-train the language model BERT.then stitches the extracted features with the output vector of BERT,and sends them into the BLSTM to capture the contextual information of the text,and finally uses CRF to learn the relationship between tags.In this paper,compared with the use of convolutional neural networks with residual operations and convolutional neural networks without convolution operations,this operation can further enhance the vectorization representation of text.Then,based on the BLSTM-CRF model again,this paper proposes a hybrid model based on the self-attention mechanism.The model uses the self-attention mechanism to ignore the distance between input texts,directly calculate dependencies,learn the internal structure characteristics of sentences,and use the self-attention mechanism after BLSTM to process the hidden layer vectors output by BLSTM.Finally,it is connected with the conditional random field to capture the internal dependence of the tag.Finally,this paper combines the above-mentioned residual convolution-based neural network model and self-attention mechanism model,while using the residual convolutional neural network to enhance the vectorized representation of word embedded text and the self-attention mechanism to capture the long-distance dependence of the text.Furthermore,we proposed a fusion residual convolutional neural network and self-attention mechanism model.It is worth mentioning that this model has a self-attention mechanism layer after the residual convolutional neural network and the BLSTM network.The experimental results show that the self-attention mechanism is effective for both the residual convolution neural network and BLSTM.Experiments show that compared with the baseline model,the attribute extraction model of the fusion residual convolutional neural network and the self-attention mechanism has improved by 0.89% and 1.59% on the MSRA dataset and CTFAE dataset,respectively.In this paper,the extracted attributes are divided into single-value attributes and multi-value attributes according to the characteristics of the attributes.The attribute fusion method based on credibility calculation and the multi-value attribute fusion method are used respectively.Finally,4071 high-reliability triples are obtained.
Keywords/Search Tags:attribute extraction, ResCNN, self-attention, attribute fusion
PDF Full Text Request
Related items