Font Size: a A A

Content Relevance-driven Research And Implementation Of Web Resource Outlier Mining

Posted on:2011-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:H JinFull Text:PDF
GTID:2218330338466967Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of Internet, People are increasingly dependent on network for obtaining information. As a massive source of information, Web can be seen as a huge database that contains a variety of valuable information, which is provided to users all over the world. Because of the huge amount of web information, the dynamics and autonomy Web resource, and the openness of the information publishing, information overload and pollution is becoming a very big problem. Necessary evaluation to web source will help users acquire high-quality information quickly. Web resource outliers mining based on content relevance researched how to acquire web content high-quality data from the view of content relevance.The content quality of web resource was evaluated and quantified by using the Web content outlier mining algorithm, and a prototype system was implemented in this paper. Web text extraction and Web outlier mining of text content quality were included in this system. In the first module, according to the feature of content centralization in news pages, in this thesis statistic-based link density and link text density method was used to extract the body of HTML pages, relevant content was integrated into an XML page; in the second module, N-gram technology was used to module each document of file sets, then the text content of outlier detection algorithm was used to detect singular text in the document and the rationality of effect was analyzed.Experimental results showed that the contents of text was extracted accurately by the statistic-based link density and link text density method in Chinese and English pages, at the same time the distance-based text mining outliers detection algorithm was used to effectively found no relevance texts in the same kind of texts. Experiments have shown that content outlier mining system of Web resources implemented in the paper has some practical value.
Keywords/Search Tags:Web Content Quality, Web Content extraction, Content Outlier Mining, DOM, VSM, N-gram
PDF Full Text Request
Related items