Research On Extraction Of Attractions Attribute Relations Based On Encyclopedic And Vertical Website Data

Posted on:2020-05-27

Degree:Master

Type:Thesis

Country:China

Candidate:Q N Lv

Full Text:PDF

GTID:2428330596992275

Subject:Computer technology

Abstract/Summary:

With the iterative update of information technology,Internet information has been growing exponentially.In the face of increasing amounts of data,how to extract the information has become the research focus in recent years.Under this background,information extraction technology emerged as the times require.As a sub-area of information extraction,attribute extraction is also an indispensable part of building knowledge graph.It is a way to convert unstructured data into structured data.At present,attribute extraction has made some achievements,but there is still a room for the adjustment of attribute extraction algorithms in specific domains.This paper focuses on the attribute extraction in the domain of Inner Mongolia tourism,aiming to discover the relationship between entities and attribute values in a specific domain,and transform extraction results into structured data that can be stored for subsequent researches.The main research contents of this paper are as follows:(1)The construction of corpus Inner Mongolia tourism field.Scrapy crawler framework is applied to obtain the entry URL of encyclopedic websites and vertical websites,and conduct the crawling of data.Brat tool is configured to manually mark the crawled corpus,which is converted into BIO annotation mode.(2)The confirmation of the attributes that need to be labeled.Attribute extraction tasks are transformed into sequence labeling tasks,and two models based on CRF and neural network are constructed respectively.(3)The proposal of a neural network model based on Doc2 vec.This model uses BLSTM layer to capture the context and timing information of the text,and CRF to output the optimal tag sequence.The doc2 vec of each document is trained,and the feasibility of the model is verified through experiments.(4)The research of the impact of training corpuses with different granularities and adding different features on model performance.Granularities are divided into character-level and word-level classes.In order to find the model with the best annotation effect,we add radical features and doc2 vec to the character-level model,and part-of-speech,character,and doc2 vec to the word-level model.

Keywords/Search Tags:

Attribute extraction, Tourism, Doc2vec, Sequence labeling, Recurrent neural network

Related items

1	Research On The Proofreading Method Of Chinese Typos Based On Sequence Labeling Mode
2	Research On Object Extraction Of Automobile Product Based On Sequence Labeling
3	Research On Text Causality Extraction Based On Deep Learning And Sequence Labeling
4	Research On Recurrent Neural Network Based Dependency Parsing Model
5	A NLP-based Novel Character Attribute Extraction System
6	Research And Implementation Of Grammatical Error Correction Based On Recurrent Neural Network
7	Research And Implementation Of Entity Relation Extraction Algorithm In News Field Based On Distant Supervision And Seouence Labeling
8	Research On Adverse Drug Reactions Text Classification And Labeling Based On RNN
9	Aspect Level Sentiment Analysis For Intelligent Tourism
10	Research On Event Extraction Algorithm Based On Sequence Labeling Model