Font Size: a A A

An Approach To The Key Problems Of Web Information Extraction Based On Prefix Expression

Posted on:2011-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:L SunFull Text:PDF
GTID:2178330305460305Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of World Wide Web leads to a rapid expansion of Web data. Considering the massive amount of Web data, the phenomenon of "rich data, poor information" attracts more and more attention. To resolve this problem, information extraction technology appears.The current Web information extraction methods being in use, which aim at articular sites and generate wrappers manually, obviously, can not adapt themselves to the portability of program or the changes of web page structure. After further researches for these problems, in order to extract information automatically, this thesis presents a new Web information extraction method using prefix expression, which works under the same domain, same level and same kind of Web pages. The main work of this thesis is as follows:(1) Propose and implement a web noise removing method based on comparison of DOM trees.Firstly, this thesis compares two random pages to find alternative noise nodes. Secondly, by comparing more pages, some fake noise nodes are filtered out. Finally, the noise set is identified by checking the location of each noise node. Therefore, program can remove every noise node in web pages with the help of noise set, which improves the efficiency and accuracy of the program.(2) Propose and implement a Web information extraction method based on prefix expression.Firstly, this thesis finds some random sample pages, and then generates the prefix expression queue for each page. Secondly, the final queue is identified by comparing the weight of different queues. Finally, information is extracted with the help of the final queue. This method this thesis uses to obtain prefix expression queue, needs no user participation, which increases automaticity of the program.The method proposed in this thesis does not require any prior knowledge of the target pages or structures, such as page layout, page style, page subject, etc.. This method does not require users to provide special training samples or source code annotations, because it will select random samples instead. This method does not require any participation of users when extracting infomation. To some extent, these features increase the automation of the program, and improve the robustness and expansibility of the program.
Keywords/Search Tags:Web information extraction, prefix expression, wrapper, reptile algorithm, Web noise removing
PDF Full Text Request
Related items