Font Size: a A A

Semi-structured Data Cleaning Technology Based On Active Learning Method

Posted on:2018-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:X M YuFull Text:PDF
GTID:2348330536481916Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet produced a large amount of data,according to the data structure can be divided into these data: high structured data,semi-structured data and the original text.The structure of the data because of its complete logical structure and description of information,can be widely used;the original text contains less information available,and need to go through complex calculations can be used;semi-structured data is between the above A form of data between the two is an extremely extensive data type on the Internet.It can be regarded as a data with a certain structure,but the structure changes greatly because there are complex and distinct distinguishing marks between the various data,Usually can not be described in a fixed form.So,how can we analyze semi-structured data to attract people's attention,this paper for the massive semi-structured data cleaning research,hoping to identify valuable information,semi-structured data to be used.It is possible to normalize the massive semi-structured data and analyze the attributes of each field,and finally form the two-dimensional structured data with attribute annotation.Such structured data can bring great convenience to subsequent analysis.To this end,this paper presents the following three ways to solve the massive semi-structured data cleansing problem:(1)This paper proposes a parallel buffering method based on double buffering,and uses double buffering message queue and thread pool to improve the speed of serial analysis.It also solves the problem of task stacking caused by inconsistent resolution in parallel resolution problem;(2)Propose the attribute set recognition method based on regular expression,use the regular expression to identify the attributes of the fields in the data,identify the attribute set according to the attribute position and the whole structure of the data,and put forward the data normalization algorithm based on the ranks statistics,The number and position of the statistical attributes,comparing the result of the statistical results with the complete set of attributes,and determining the columns of each field to form structured data with attribute labels;(3)Propose the method of active learning to improve the accuracy of attribute recognition.The structured data of annotated attributes is used as the training set,and the classification model is constructed by using C4.5 algorithm.The accuracy of learning model attribute recognition is further improved by using the classifier optimization method based on active learning.In this paper,an uncertain sampling algorithm based on voting mechanism is proposed,and the samples which can affect the accuracy of the classifier are selected to be indexed by the transpose and the classification model is updated.Finally,a high efficiency,high accuracy and high availability data are formed Cleaning research methods,can be known to improve the success rate of data to 95% or more.
Keywords/Search Tags:Actice Learning, Regular expression, Data Cleaning, Semi-structrued, Double buffering queue
PDF Full Text Request
Related items