Font Size: a A A

Research On Extraction Of Attractions Attribute Relations Based On Encyclopedic And Vertical Website Data

Posted on:2020-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:Q N LvFull Text:PDF
GTID:2428330596992275Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the iterative update of information technology,Internet information has been growing exponentially.In the face of increasing amounts of data,how to extract the information has become the research focus in recent years.Under this background,information extraction technology emerged as the times require.As a sub-area of information extraction,attribute extraction is also an indispensable part of building knowledge graph.It is a way to convert unstructured data into structured data.At present,attribute extraction has made some achievements,but there is still a room for the adjustment of attribute extraction algorithms in specific domains.This paper focuses on the attribute extraction in the domain of Inner Mongolia tourism,aiming to discover the relationship between entities and attribute values in a specific domain,and transform extraction results into structured data that can be stored for subsequent researches.The main research contents of this paper are as follows:(1)The construction of corpus Inner Mongolia tourism field.Scrapy crawler framework is applied to obtain the entry URL of encyclopedic websites and vertical websites,and conduct the crawling of data.Brat tool is configured to manually mark the crawled corpus,which is converted into BIO annotation mode.(2)The confirmation of the attributes that need to be labeled.Attribute extraction tasks are transformed into sequence labeling tasks,and two models based on CRF and neural network are constructed respectively.(3)The proposal of a neural network model based on Doc2 vec.This model uses BLSTM layer to capture the context and timing information of the text,and CRF to output the optimal tag sequence.The doc2 vec of each document is trained,and the feasibility of the model is verified through experiments.(4)The research of the impact of training corpuses with different granularities and adding different features on model performance.Granularities are divided into character-level and word-level classes.In order to find the model with the best annotation effect,we add radical features and doc2 vec to the character-level model,and part-of-speech,character,and doc2 vec to the word-level model.
Keywords/Search Tags:Attribute extraction, Tourism, Doc2vec, Sequence labeling, Recurrent neural network
PDF Full Text Request
Related items