Font Size: a A A

Research On Elimination Of Similar Web Pages Based On Text Structure And Extraction Of Long Sentences

Posted on:2011-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:S FengFull Text:PDF
GTID:2178360308458651Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Research shows that similar web pages account for 29 percent of total web pages, and that completely same ones account for 22 percent. According to a report released by China Internet Network Information Center (CNNIC) in July 2005, when asked"what is the biggest problem in searching for information", 44.6 percent of the web users selected the answer"too much repetitious information". The proportion of the choice ranked first in the survey. If search engines are able to find out these repetitious web pages and eliminate them from database, storage space can be partly saved and the collecting speed of efficient web pages can be enhanced. Moreover, more reasonable searching strategy and ranking algorithm of output results can be established according to the similarity of web pages. Therefore, it can be concluded that how to detect those web pages with similar contents quickly and accurately has become one of the key technologies to improve the quality of search engines.By analyzing a great number of similar web pages, this paper finds out their two features:①The text of similar web pages can be transformed into a text structure tree. Title is the root while each paragraph is a node in the tree with different level corresponding to its place in the text structure.②The text content of similar web pages is prone to get changed while the form of text structure usually remains unchanged or changes little. Even in the worst case of pagination and reprinting, the text structure will not change a lot. The only difference is that after pagination and reprinting the text structure tree is one or several branch(es) of the original web pages structure tree.As regard to the feature of the similarity and that of the text structure of web pages, a dynamic, stratified and robust algorithm of elimination of similar web pages is proposed. Firstly, the method carries out necessary noise disposition. Secondly it transfers the text into a text structure tree by algorithm of generating text structure tree (the node of the tree is a paragraph in the text). Then stratified fingerprint can be obtained by dynamically and stratifiedly extracting paragraphs from the tree. The paragraphs thus are taken as input of algorithm of extracting long sentences. At last, the similarity of web pages can be got by layer fingerprint algorithm of calculating similar degree. Thus, the detection of completely or partly similar web pages can be realized. Dynamic and stratified feature extraction of text and layer fingerprint calculation guarantee the efficiency of elimination of similar web pages. The application of algorithm of extracting long sentences to get node fingerprint ensures the robustness.Experiments indicate that compared with the algorithm based on natural paragraph signature and extraction of long sentences, the method takes an advantage over them concerning recall rate. It is capable of detecting not only partly similar web pages with information added and deleted at the beginning and the end of the text, but also those with information added and deleted just in the text and those Multi-pages near-replicas web page. Therefore, the method holds a bright application vision and relatively high research value in the aspect of filtering similar web pages.
Keywords/Search Tags:elimination of similar web pages, text structure tree, extraction of long sentences, layer fingerprint
PDF Full Text Request
Related items