Font Size: a A A

The Research About The Similar Content In Mobile Internet

Posted on:2014-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:Q ChenFull Text:PDF
GTID:2268330422963267Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development of the Internet, the information was explosive growth. Due tothe presence of many mirror sites, reproduced pages, copy pages, the network was floodedwith similar content. The content reduce the quality of search engine results, wastehardware storage resources, impact the mobile user experience. With the development ofmobile internet in recent years, the problem has become increasingly serious.In the absence of the research about the similar content in mobile internet, the subjectis focused on web content extraction and the similarity calculation of web page. As to theweb content extraction technology, firstly the paper compares based on the statistics ofweb content extraction technology, based on visual-block web content extractiontechnology and other web content extraction technology, then proposes topic-basedsimilarity block of web content extraction technology. As to the similarity calculation ofweb page, firstly the paper compares the vector-based similarity of web page, thefeature-based similarity of web page, the similarity of web page based on the structure ofthe text, the Semantics-based of web page, and proposes the similarity of web page basedon feature word.The topic-based similarity block of web content extraction technology is based on thesimilarity between the title and content, build the html tree, and extract the text content ofthe web page. Experiments show that the algorithm is effective for complex web pageswith high precision.The algorithm about similarity firstly extracts feature words of web page, uses thelocal sensitive hash and block searching technology to calculate the similarity of the pages.he experiments show that the algorithm improves the recall and precision of the short textpages, reducing the complexity, and is suitable for large-scale data applications.
Keywords/Search Tags:mobile internet, web content extraction, the similarity calculation of web page, local sensitive hash
PDF Full Text Request
Related items