Font Size: a A A

Design And Develop A Common HTML Content Parse System Based On Jsoup

Posted on:2016-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:K MaoFull Text:PDF
GTID:2308330482481348Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advanced development of Internet technology and information, the quantity of webpage increases rapidly.Under this background,WEB intelligent information retrieval, automatic document summary, public opinion analysis come out. These technology basicly require the acquisition and analysis of massive Internet WEB page. Normally these technologies use web crawler to get original pages from the network information. These original pages they get not only contain information which the user interested with, but also often contain a variety of network noise data, such as advertising link, tag information, navigation links, comments and so on. These noisy data greatly affect the network retrieval of efficiency and reduce people’s reading efficiency. How to correctly and efficiently extract the main article from the most flexible, semi-structured, heterogeneous HTML source file become more and more imprtant in Internet-based data mining, information retrieval, and other fields.This paper describes the design and implementation of a general purpose web page text extraction system based on updating and comparering parent node weight algorithms.The system is based on Jsoup which is a excellent HTML parsing tool and base on the noise reduction process. After gained information node from stop words analysis and links dense degree analysis,our system using updating and comparering parent node weight algorithms to caculate the weights of each node to gain the best subtree in the DOM.The system uses B/S(Browser/Server) architecture, makes Intellij Idea as a developing tool for the front-end interface and back-end service and uses H2 as back-end database management system.The system is divided into four core modules:document format processing module, document cutting noise reduction module, text node judge module, text format output module.The system has a high accuracy in various HTML content extracting test especially in news site extracting test.
Keywords/Search Tags:Web page text extraction, updating and comparering parent node weight algorithms, Word segmentation, links dense degree, B/S architecture
PDF Full Text Request
Related items