Font Size: a A A

Study Of Web Data Extraction Based On PAT And MLN

Posted on:2013-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:X J LiuFull Text:PDF
GTID:2248330362473859Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and its applications, the amount of data onthe Internet has increased fiercely, which makes the data on web pages has become ahuge database containing a large number of potentially useful information. It has drawna lot of attention that how to extract the data concerned by Internet users from the largedatabase. The task of web data extraction is extracting data from these semi-structuredweb pages, and converting the data into a structured form for subsequent applications.Many researchers have devoted a lot of efforts in the area of web data extractionand have developed a variety of web data extraction systems. These systems takevarious methods and technologies, including specially designed development language,natural language processing, machine learning, pattern mining and ontology technology,et al. These systems have their own features and advantages, but also have a significantconstraint that can extract data only from certain types of web pages. For example,methods based on pattern mining can only extract data between pairs of tags from thepages with a clear structure, and are incapable of extracting information from the textbetween pairs of tags.Usually the target page contains a large number of target data in the form ofcontinuous pattern with little change on the whole. According to this characteristic, thispaper presents an information extraction method based on Patricia Tree and Markovlogic network. The main idea is that firstly getting the optimal block pattern from thepotential patterns generated by the PAT tree’s ability of discovering frequent patterns,then carrying out more refined extraction by Markov logic network on data blocksextracted by the block pattern above.The detailed process of the method proposed in this paper is below. The first step isto convert the whole page into a token string which only contains structure tags withignorance of all text tags in the web page. As each token has a fixed length binary code,the token string is converted into a binary string called sistring, namely semi-infinitestring of the page, in accordance with the correspondence between token and its binarycode. Then, a Patricia Tree of the page is constructed by the Patricia Tree algorithmwhich is always used to explore potential high frequency pattern. After the candidatepatterns have been filtered out from the potential patterns under a few rules, the optimalpattern is determined when the user is labeling target data area in the training pages. With the optimal pattern, we get the target data blocks if the data area matches thepattern. Then, according to the structural features of target data blocks, atom predicatesand one-order logic formulas are proposed to construct a Markov logic network, whichwill yield the weight of each formulas from a large amount of training set throughproper weight-learning and reasoning algorithms. At last, the target data is extracted bythe query predicate.The method proposed in this paper overcomes the shortcomings of traditionalmethods. It can not only deal with the web pages with clear structure effectively, butalso be able to handle the situation that pages contain a large number of text information.And through experiments on two different kinds of datasets, this method is proved to bemore effective than traditional data extraction methods in either case.
Keywords/Search Tags:Web data extraction, PAT tree, Markov logic network, Pattern mining
PDF Full Text Request
Related items