Research On Web Information Extraction Tool

Posted on:2012-06-18

Degree:Master

Type:Thesis

Country:China

Candidate:H W Liang

Full Text:PDF

GTID:2218330374453430

Subject:Computer software and theory

Abstract/Summary:

With the development of the technology, popularizing rate of the computer is increasing, more and more people browse information on Internet.Today, people use the Internet in living, work and business activities, web has become an important way of people obtaining information. Web pages contain text, images, videos, music and so on. Different people like different web information, the information that people are not interested in scatter around the information that people are interested in, they distract from people's attention, it is inconvenience to reading web information.The paper presents a DOM-based Web information extraction methods, the way can filter out the information that people are not interested in the web pages, leaving only the information that the people are interested in.This method is not mechanical to find the information that we are interested in, but delete the information that we are not interested in. First, we use the Eclipse development tools, use HTML parser NekoHTML of open source parse web pages to a DOM tree. The paper use depth-first search algorithm to recursively traverse every node of the DOM Tree to determine whether the node contains the information that we are interested in. We preserve the node that contains the information that we are interested in, we delete the node that contains the information that we are interested in. The paper use java programming language to implement extraction algorithms of web information, use the JSP and Servlet to develop graphical user interface. The paper use extraction algorithm to delete the information that the user is not interested in and retain only the information that he user is not interested in.Users can choose their favorite information by graphical interface, our extraction algorithm will be based on the user's choice, to delete the information that users are not interested in, to return the information they like. The paper first introduces the purpose of studying the Web information extraction tools, and then analyze the advantages and disadvantages of 11 types of Web information extraction technology, introduces the web page type and web page composition, and then introduces the DOM tree and the open source web analytic tools NekoHTML, the final design Web information extraction algorithms, complete implementation of Web information extraction tools.

Keywords/Search Tags:

HTML, Information Extraction, DOM, NekoHTML, Web Page

Related items

1	Research And Application On The Technology Of Web Information Extraction Based On The HTML
2	Research On Content Extraction In HTML Web Pages Based Multi-Features
3	Research On The Technology Of The Web Employment Information Extraction Based On The HTML
4	Based On The Html Pages Of Web Information Extraction
5	Research On The HTML And PDF Informaiton Extraction Technology Based XML
6	The Technology Of Web Information Extraction Based On HTML Parser
7	Design And Implementation Of A Conventional Template About Page Extraction
8	The Implementation And Application Of Extracting Structured Data From Web Pages
9	The Research Of Web Pages Information Extraction Based On Page Structure Analysis Technique
10	Research On Multi-page Special Web Page Text Extraction And Merging Technology