Study Of Web Data Extraction Based On PAT And MLN

Posted on:2013-01-11

Degree:Master

Type:Thesis

Country:China

Candidate:X J Liu

Full Text:PDF

GTID:2248330362473859

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet and its applications, the amount of data onthe Internet has increased fiercely, which makes the data on web pages has become ahuge database containing a large number of potentially useful information. It has drawna lot of attention that how to extract the data concerned by Internet users from the largedatabase. The task of web data extraction is extracting data from these semi-structuredweb pages, and converting the data into a structured form for subsequent applications.Many researchers have devoted a lot of efforts in the area of web data extractionand have developed a variety of web data extraction systems. These systems takevarious methods and technologies, including specially designed development language,natural language processing, machine learning, pattern mining and ontology technology,et al. These systems have their own features and advantages, but also have a significantconstraint that can extract data only from certain types of web pages. For example,methods based on pattern mining can only extract data between pairs of tags from thepages with a clear structure, and are incapable of extracting information from the textbetween pairs of tags.Usually the target page contains a large number of target data in the form ofcontinuous pattern with little change on the whole. According to this characteristic, thispaper presents an information extraction method based on Patricia Tree and Markovlogic network. The main idea is that firstly getting the optimal block pattern from thepotential patterns generated by the PAT tree’s ability of discovering frequent patterns,then carrying out more refined extraction by Markov logic network on data blocksextracted by the block pattern above.The detailed process of the method proposed in this paper is below. The first step isto convert the whole page into a token string which only contains structure tags withignorance of all text tags in the web page. As each token has a fixed length binary code,the token string is converted into a binary string called sistring, namely semi-infinitestring of the page, in accordance with the correspondence between token and its binarycode. Then, a Patricia Tree of the page is constructed by the Patricia Tree algorithmwhich is always used to explore potential high frequency pattern. After the candidatepatterns have been filtered out from the potential patterns under a few rules, the optimalpattern is determined when the user is labeling target data area in the training pages. With the optimal pattern, we get the target data blocks if the data area matches thepattern. Then, according to the structural features of target data blocks, atom predicatesand one-order logic formulas are proposed to construct a Markov logic network, whichwill yield the weight of each formulas from a large amount of training set throughproper weight-learning and reasoning algorithms. At last, the target data is extracted bythe query predicate.The method proposed in this paper overcomes the shortcomings of traditionalmethods. It can not only deal with the web pages with clear structure effectively, butalso be able to handle the situation that pages contain a large number of text information.And through experiments on two different kinds of datasets, this method is proved to bemore effective than traditional data extraction methods in either case.

Keywords/Search Tags:

Web data extraction, PAT tree, Markov logic network, Pattern mining

PDF Full Text Request

Related items

1	Research On Some Problems Of Statistical Relational Learning
2	Research On Markov Logic Networks And Its Application
3	Logic-based Frequent Sequential Pattern Mining Algorithm
4	The Research On The Related Problems Of Association Rule Mining
5	Study On Association Rules Mining Algorithm Based On FP-tree
6	Research On Frequent Pattern Of Tree Data
7	The Analysis, Based On Data Mining Algorithms For Frequent Pattern Tree
8	Identification Of Complex Names Of Cambodian Institutions Based On Markov Logic Network
9	XML Data Mining Based On Frequent Pattern Tree
10	Research On Mining Algorithms Of Maximal Frequent Item Sets