An Approach To The Key Problems Of Web Information Extraction Based On Prefix Expression

Posted on:2011-04-21

Degree:Master

Type:Thesis

Country:China

Candidate:L Sun

Full Text:PDF

GTID:2178330305460305

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The rapid development of World Wide Web leads to a rapid expansion of Web data. Considering the massive amount of Web data, the phenomenon of "rich data, poor information" attracts more and more attention. To resolve this problem, information extraction technology appears.The current Web information extraction methods being in use, which aim at articular sites and generate wrappers manually, obviously, can not adapt themselves to the portability of program or the changes of web page structure. After further researches for these problems, in order to extract information automatically, this thesis presents a new Web information extraction method using prefix expression, which works under the same domain, same level and same kind of Web pages. The main work of this thesis is as follows:(1) Propose and implement a web noise removing method based on comparison of DOM trees.Firstly, this thesis compares two random pages to find alternative noise nodes. Secondly, by comparing more pages, some fake noise nodes are filtered out. Finally, the noise set is identified by checking the location of each noise node. Therefore, program can remove every noise node in web pages with the help of noise set, which improves the efficiency and accuracy of the program.(2) Propose and implement a Web information extraction method based on prefix expression.Firstly, this thesis finds some random sample pages, and then generates the prefix expression queue for each page. Secondly, the final queue is identified by comparing the weight of different queues. Finally, information is extracted with the help of the final queue. This method this thesis uses to obtain prefix expression queue, needs no user participation, which increases automaticity of the program.The method proposed in this thesis does not require any prior knowledge of the target pages or structures, such as page layout, page style, page subject, etc.. This method does not require users to provide special training samples or source code annotations, because it will select random samples instead. This method does not require any participation of users when extracting infomation. To some extent, these features increase the automation of the program, and improve the robustness and expansibility of the program.

Keywords/Search Tags:

Web information extraction, prefix expression, wrapper, reptile algorithm, Web noise removing

PDF Full Text Request

Related items

1	A Web News Extraction Method Based On Filtering Noise Wrapper
2	Research For Information Extraction Based On Wrapper Model Algorithm
3	Algorithm Research For Text Information Extraction Based On Wrapper Model
4	Web Page Attribute Extraction Method Research
5	Research And Implementation Of Page Object Extraction Model For Vectical Search Engine
6	The Research And Implementation Of Web Information Extraction System Based On The Regular Expression
7	Algorithm Of Poisson-noise Removing Based On Ica And Its Application On Ct Imaging
8	Algorithm Of Poisson-Noise Removing Based On ICA And Its Application On CT Imaging
9	Research On Removing Noise For Iris Image
10	Research On Automatic And Efficient Technologies For Web Information Extraction