Font Size: a A A

Web Page Information Filtering Method Research Based On Vector Space Model

Posted on:2009-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:X T WuFull Text:PDF
GTID:2178360242467489Subject:E-commerce and logistics management
Abstract/Summary:PDF Full Text Request
The development of internet boosts the development and transformation of society. The prevalence of electronic commerce changes people's life style. But with the rapid development of electronic commerce, more and more security problems occur. Illegal web sites such as phishing web sites and unhealthy information such as superstition, pornography, violence, and anti-government threaten the content security of electronic commerce environment. Therefore it is necessary to filter unhealthy network information so as to keep electronic commerce environment secure, healthy and harmonious. Currently traditional filtering technology based on keyword and URL can't solve the problem effectively.This paper introduces the situation of content security technology, applies information filtering method based on content analysis to protect content security, and studies Chinese word segmentation, text presentation and feature extraction in information filtering. Considering HTML tags can affect weight computation, an improved weight computation method based on TFIDF and HTML tags' weight is put forward.To improve the accuracy of web pages information filtering system, this paper studies approach to content extraction from Web Page. After analyzing the characteristic of Chinese web pages layout and the layout feature of Chinese punctuation in web pages, a new content extraction method is proposed, which can recognize web page content according to the number of Chinese punctuations and the ratio of non-hyperlink character number to character number that hyperlinks contain. Experimental results show that this method is accurate and suitable for most web sites.Finally, this paper proposes a new information filtering scheme and implements it, which applies two-level filtering strategy and combines filtering technology based on URL and content filtering. It only executes content filtering when the requested URL isn't in white URL lists and black URL lists, and updates URL lists according to content filtering step's result. In this way, it has both real-time characteristic of URL filtering and comprehensive characteristic of content filtering. The web page information filtering system captures HTTP packets by using Winsock 2 SPI, extracts web page content by applying the new proposed method and represents text by vector space model. Experimental results show that the system has good filtering accuracy and performance.
Keywords/Search Tags:Content Security, Information Filtering, Web Page Content Extraction, Vector Space Model
PDF Full Text Request
Related items