Font Size: a A A

Research On Webpage Cleaning Technology About Web Data Fusion

Posted on:2015-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y L HeFull Text:PDF
GTID:2298330434455049Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The users usually take longer time than expected to obtain the information because of the repeated webpage and noise existing on the network. So it becomes necessary to perform the cleaning of the webpage before it is rendered to the user.A new method is proposed to extract the main content of the webpage by using the hierarchy structure of DOM tree and the feature information. All operations performed on the DOM tree, this method can keep the structure information of Web content completely, which could be integrated with the subsequent application directly. Because all text in the document is contained within the leaf nodes, the statistical information calculated only about the leaf nodes can be more accurate.In order to improve the representative of the feature-string, a new method based on the idea "first segmentation, then extraction" is discussed. DTSF algorithm-an improved algorithm of TSF is used to divide a document into independent topically coherent segments. User involvement is not needed in DTSF which dynamically designate the block size by roughly dividing the text into several groups, and automatically identify topic boundaries. The feature-string extracted from each segment follows the change of sub-topic to some extent and represents the webpage content more completely.Because of the good performance of simHash algorithm in space and time, this thesis utilizes the method to generate a fingerprint for each topic segment. Before the Hamming distance between corresponding fragments of two document is calculated, which is the basis for judgment of similarity, we filter the Webpage library based on the number of the topic segments and the length of the text, reducing the number of webpage which need to be retrieved and improving the efficiency of retrieval.The research work of this thesis allows the application program take the main contents of webpage as treatment objects, avoiding the process about the duplicate webpages and unrelated content in webpage. It can save the storage space, improve the retrieval performance and reduce overhead of time and space of subsequent processing. The efficiency and accuracy of the entire Web fusion system will be improved greatly.
Keywords/Search Tags:Webpage cleaning, Words/leafs ratio, Duplicate Webpage, Topic segmentation, Hierarchical retrieval
PDF Full Text Request
Related items