Font Size: a A A

Web Page Information Filtering Method Research Based On Vector Space Model

Posted on:2011-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y LiFull Text:PDF
GTID:2178360305485333Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology network, the user can easily and quickly through the network using vast amounts of shared information, while "information explosion", "information overload", "Information Junk" and other problems become increasingly serious. And those useless or harmful information of the amount of information far exceeds the amount of information we need, it brought a lot of inconvenience to people. How to accurately express the user needs, and then the information flow in large-scale automatically selected to meet user needs the information and filter out useless information and bad information to make people more effective use of information resources, has enabled us to problems to be solved. Based on the above problem, this paper presents LAN-based information filtering in study. It not only allows filtering undesirable web page can also achieve a Web filtering based on subject interest.This article describes the development of web filtering this situation, information filtering methods, and discussed in detail in the page text filter in the key technology used in its realization of the process, the last of the user template adaptive learning. Web-based filtering, this paper is the classification used filtering strategy, starting with the implementation of data packets flowing through the gateway, and keyword filtering based on IP technology, and finally focuses on the content-based filtering technology and implementation process. Content-based filtering technology consists of two parts, namely, the network data processing part and the text on the web page information processing section. The processing of network data, this article mainly discusses the WinpCap under Windows based packet capture and protocol by TCP, IP protocol, HTTP message analysis, filter does not contain text\html data packets, and then propose a linked list of packet reduction algorithm reloading the page to restore them, while in the process of filtering based on keywords, this paper, the improved multi-keyword matching algorithm which is based on protocol analysis of more than keyword matching algorithm can greatly improve the match efficiency. In the page of text processing, this article uses the vector space model to represent the form of text, web text for this particular document, this improved vector space model to that text. As the web has a special structure of the text is the text, it contains useful information mainly between labels in certain pages, this is by order of feature extraction of the template, the text to appear on the page accurate processing, avoids entire document handling, particularly when the information flow more than the documentation related to Africa and the large text data document, can greatly improve the efficiency of web page classification. Finally, we describe the template of the user's learning; improve the Rocchio algorithm used to update the template, to improve the web filtering precision.
Keywords/Search Tags:vector space model, web extraction, WinPcap, document representation
PDF Full Text Request
Related items