Design And Develop A Common HTML Content Parse System Based On Jsoup

Posted on:2016-04-21

Degree:Master

Type:Thesis

Country:China

Candidate:K Mao

Full Text:PDF

GTID:2308330482481348

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the advanced development of Internet technology and information, the quantity of webpage increases rapidly.Under this background,WEB intelligent information retrieval, automatic document summary, public opinion analysis come out. These technology basicly require the acquisition and analysis of massive Internet WEB page. Normally these technologies use web crawler to get original pages from the network information. These original pages they get not only contain information which the user interested with, but also often contain a variety of network noise data, such as advertising link, tag information, navigation links, comments and so on. These noisy data greatly affect the network retrieval of efficiency and reduce peopleâ€™s reading efficiency. How to correctly and efficiently extract the main article from the most flexible, semi-structured, heterogeneous HTML source file become more and more imprtant in Internet-based data mining, information retrieval, and other fields.This paper describes the design and implementation of a general purpose web page text extraction system based on updating and comparering parent node weight algorithms.The system is based on Jsoup which is a excellent HTML parsing tool and base on the noise reduction process. After gained information node from stop words analysis and links dense degree analysis,our system using updating and comparering parent node weight algorithms to caculate the weights of each node to gain the best subtree in the DOM.The system uses B/S(Browser/Server) architecture, makes Intellij Idea as a developing tool for the front-end interface and back-end service and uses H2 as back-end database management system.The system is divided into four core modules:document format processing module, document cutting noise reduction module, text node judge module, text format output module.The system has a high accuracy in various HTML content extracting test especially in news site extracting test.

Keywords/Search Tags:

Web page text extraction, updating and comparering parent node weight algorithms, Word segmentation, links dense degree, B/S architecture

PDF Full Text Request

Related items

1	Research On WEB Page Classification Algorithms Based On Text Semantic Graph
2	Research On Chinese Text Categorization Algorithms Based On Technology Text
3	The Design And Implementation Of Text Topic Key Word Processing System Based Chinese Word Segmentation
4	Research On Web Page Content Extraction Based On Hadoop
5	Visual Web Page Information Extraction And Text Feature Word Extraction Technology Research
6	Research And Implement On The Related Algorithms Of Chinese Text Classification
7	Method Of Webpage Keyword Extraction Based On Word Span
8	WEB Mining System
9	Research On Chinese Text Similarity Detection Technology Based On Word Weight Analysis
10	The Design And Implement Of Web Page Automatic Categorization And Storage Management System