Font Size: a A A

Chinese Web Text Filtering Technology Research

Posted on:2011-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:L J WangFull Text:PDF
GTID:2178360308981408Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
with the increasing popularity of internet, people are increasingly dependent on the network, equality, openness, unbounded networks, etc of the internet. has led to unlimited abuse, a lot of rubbish and sensitive information overload on the network, especially for majority of young students, a number of "harmful information" that is threatening their physical and mental health. how to help users more convenient and effective use of available network resources, and to get useful information is a research direction of information processing.the current web filtering system is mainly used URL filtering and keyword filtering technology, but these technologies in the web filtering are deficiency both the accuracy and speed. web filtering to improve the accuracy and speed must be in-depth analysis of web content. web page is a structured document, DOM is an HTML and XML documents for the flexible operation of the programming interface. in a detailed analysis of the structure of web pages, this paper put forward the resolution in accordance with the structure of the web pages,using DOM extraction page of the text content in different elements of the document. this first elaborate the basic information filtering on the web, including the basic principles of information filtering, filtering system in general processing, classification and performance evaluation indicators of filtering system.then, focusing on web content filtering in-depth analysis and discussion to the key technologies involved in the text,mainly include chinese word segmentation techniques, text feature extraction techniques, user interest model representation and updating as well as text filtering technology. contrary to the current low status of extraction in the web information extraction technology, in this paper, proposed based on the HTML tree and content analysis of adaptable information extraction. contrary to vector space model for the filter structure of regardless on the page weight,makes the reasons for the low filtration,improved vector space model representation of the text vector,experimental results show the improvement of vector space model is more suitable for web page text filters. based on the research, designed a prototype system to chinese web filtering, and detail the overall framework of the system,functional modules,as well as the main method of system implementation, finally, the system was tested,experiments show that the system has good performance of information filtering.
Keywords/Search Tags:information extraction, web filtering, DOM tree, vector space model, information filtering
PDF Full Text Request
Related items