Font Size: a A A

CNN-based Text Record Segmentation And Named Attribute Values Recognition

Posted on:2019-12-29Degree:MasterType:Thesis
Country:ChinaCandidate:M HuFull Text:PDF
GTID:2428330545451225Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Information Extraction refers to the automatic extraction of structured information from semi-structured or unstructured data,which is an important technique support for many field,such as Text Understanding,Information Retrieval,Question Answer and Knowledge Graph building.This paper focus on a common problem in the field of Open Information Extraction,Text Record Segmentation and Named Attribute Values Recognition(Tesa Rec).This problem aimed to segment attribute values contained in an unstructured textual without any explicit delimiters,and then further recognising and labelling them.A fairly common approach to solve this problem is the use of machine learning techniques,either supervised with human-driven training,or unsupervised with training provided by some form of pre-existing data source.Among the supervised approaches,the dominant one employs statistical models such as Hidden Markov Models(HMM)or Conditional Random Fields models(CRF)to learn a segmentation model for a given domain.Supervised approaches turn to use pre-existing datasets to alleviate the need for manually labeled training data.These un-supervised methods take advantage of known values of a given attribute to train a model for recognizing values of this attribute occurring in an input textual record.However,all the supervised approaches require a large labelled training data set which might be unfeasible in some domains.Two main problems may happened in these unsupervised approaches,(1)attribute values is only with a single total order for the input texts,(2)the match function in some methods showed a low performance.In order to solve these problems,we introduce a novel unsupervised approach based on pre-existing data and a Convolution Neural Network(CNN)-based model.We make full use of the CNN's ability of extracting features and combining features,and combined CNN and probabilistic model to build a complete and high-performance extraction model.More details are shown as follows:(1)We focus on the problem of Tesa Rec in the paper.Some existing Tesa Rec methods are introduced here and the advantages and disadvantages of them are also analysed.We also introduce the related work of Deep Learning used in some related fields.(2)We proposed a novel Tesa Rec method which based on CNN,then we designed a greedy probabilistic labelling algorithm to find a most probable segmentation and labelling way to the input text by considering the overall segmentation and labelling situation.We finally deploy a Bidirectional Positioning and Sequencing(BPSM)model learned ondemand from the test data to do further adjustment to some problematic labelled segments.Our method effectively solved problems of traditional methods,and improves the extraction quality of state-of-art approaches by more than 10%,and also has good performance in efficiency.(3)We build a system to train our CNN model automatically and use our method to extract input text,this system help us train our model in other datasets more convenient.Beside,we can check each extraction step of our method in this system.We demonstrate the effectiveness and availability of the proposed methods on three real-world datasets.Our empirical study shows that our proposed CNN-based Tesa Rec method outperform the state-of-art Tesa Rec methods by reaching a higher precision and efficiency.
Keywords/Search Tags:Information Extraction, Deep Learning, Text Segmentation, Named Attribute Values Recognition
PDF Full Text Request
Related items