Research On Webpage Cleaning Technology About Web Data Fusion

Posted on:2015-07-16

Degree:Master

Type:Thesis

Country:China

Candidate:Y L He

Full Text:PDF

GTID:2298330434455049

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

The users usually take longer time than expected to obtain the information because of the repeated webpage and noise existing on the network. So it becomes necessary to perform the cleaning of the webpage before it is rendered to the user.A new method is proposed to extract the main content of the webpage by using the hierarchy structure of DOM tree and the feature information. All operations performed on the DOM tree, this method can keep the structure information of Web content completely, which could be integrated with the subsequent application directly. Because all text in the document is contained within the leaf nodes, the statistical information calculated only about the leaf nodes can be more accurate.In order to improve the representative of the feature-string, a new method based on the idea "first segmentation, then extraction" is discussed. DTSF algorithm-an improved algorithm of TSF is used to divide a document into independent topically coherent segments. User involvement is not needed in DTSF which dynamically designate the block size by roughly dividing the text into several groups, and automatically identify topic boundaries. The feature-string extracted from each segment follows the change of sub-topic to some extent and represents the webpage content more completely.Because of the good performance of simHash algorithm in space and time, this thesis utilizes the method to generate a fingerprint for each topic segment. Before the Hamming distance between corresponding fragments of two document is calculated, which is the basis for judgment of similarity, we filter the Webpage library based on the number of the topic segments and the length of the text, reducing the number of webpage which need to be retrieved and improving the efficiency of retrieval.The research work of this thesis allows the application program take the main contents of webpage as treatment objects, avoiding the process about the duplicate webpages and unrelated content in webpage. It can save the storage space, improve the retrieval performance and reduce overhead of time and space of subsequent processing. The efficiency and accuracy of the entire Web fusion system will be improved greatly.

Keywords/Search Tags:

Webpage cleaning, Words/leafs ratio, Duplicate Webpage, Topic segmentation, Hierarchical retrieval

PDF Full Text Request

Related items

1	The Research And Design Of Network Information Monitoring And Analysis System
2	Reserch And Implementation Of Webpage Cleaning Algorithm Based On Visual Information
3	Design And Implementation Of Webpage Tampering Monitoring System
4	How Website Complexity And The Ratio Of Picture To Text Impact Webpage Aesthetic And Usability
5	Research On Topical Webpage Denoising Based On Improved DOM Tree
6	Organization Entity Information Extractor From Webpage Base On CRF
7	Distributed Retrieval System With Webpage Ranking Improvement Based On Lucene
8	The Research And Implementation Of The Searching System Based On Special Informations In The Internet Environment
9	Webpage Segmentation Algorithm Based On Planar Graphs
10	Research Of The Technologies In Identifying And Filtering Webpage Noise Information Based On The Proxy System