Font Size: a A A

The Research Of Webpage Filtering Based On Concept Fusion

Posted on:2014-01-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:1228330401463103Subject:Information security
Abstract/Summary:PDF Full Text Request
With the rapid development of the information technology, the Internet is playing a more and more important role in daily communications. As the development of the web has been paralleled by the proliferation of a harmful content on its pages, the emergence of harmful content on the web has led to the necessity of providing filtering systems designed to secure the access to the Internet. Web filtering intends to prevent access to harmful web page, and it generally depends on effective classification technology to analyze web page content intelligently. At present, there are two mainstream technologies for web page content filtering, i.e., text-based (TBIF) and image-based (IBIF) filtering technology. However, web pages usually contain visual image contents and textual information in the actual network environment, and current technology that simply based on image filtering or text filtering, could not achieve a sound performance in content filtering. In this case, we mainly focus on the fusion of textual and image, and demonstrated that it can be applied to improve the filtering efficiency. These factors, i.e., the effectiveness of feature, heterogeneity of multi-modal and real-time of filtering, these factors are very important for improving performance during processing. As a result, this paper employs some relevant technologies around the following problems about web page represent, heterogeneous feature fusion and filtering speed, in order to improve the performance of web page fusion filtering. The main research works are as follows:1) Web page filtering framework based on textual and image concept fusionWeb page contains both textual and image on the actual Internet. We usually use single model of information to represent webpage, but this method can only filter a part of harmful information. Therefor the fusion of textual and image is a key technology for improving the accuracy of Web page filtering. We proposed a Web page concept fusion filtering framework based on both textual and image to solve the heterogeneity problem for textual and image fusion.2) Meaningful string extraction algorithm for textual and image concept spaceAccuracy feature representation is a basic step for Web page content filtering processing. Meaningful strings represent some specific new words and phrases, which are usually in used frequently on Internet and can be applied to optimize text description model. The existing meaningful string extraction methods have not consider the correlation between strings, the heterogeneity between textual and image in fusion framework is a very important factor for extracting meaningful string. In this case, we apply the clustering technology to extract the collection of string and propose a meaningful string extraction algorithm for textual and image concept space. Our results show that the representation of web page content with concepts can optimize vector space model, and the proposed method can improve the efficiency of classification.3) Multiple feature concept fusion based on Gaussian local multiple kernel weightFeature fusion is a key step for web page content filtering with accuracy. The traditional feature fusion methods do not consider the potential correlation and heterogeneity between features. On the basis of multiple kernel learning theory research, we proposed a multiple feature concept fusion based on Gaussian local multiple kernel weight(MLMKL). Since the local information of multiple features for the uniform concept space, we obtained local weight model by using Gaussian model to simulate data distribution. Afterwards we can get the different kernel weight for multiple local features. MLMKL solves the heterogeneity of feature fusion, and simultaneously solves the effective description problem with local multiple kernel weight model. The results demonstrate that compared to the existing method, the MLMKL method we proposed gets better accuracy and test speed.4) Index filtering algorithm based on minimum enclosing circleFor now, web filtering execution generally applies pattern classification methods by using statistics. Though these methods can obtain high classification accuracy, they can become inefficient with very largescale sizes. In order to solve this problem, index technology has been proposed to improve the speed of data query by considering the efficient partition of data space. In the traditional method for index building, the imbalance data distribution on the real Internet was not taken into account. Therefore, we proposed a new index filtering algorithm based on minimum enclosing circle area partition (MECI). Considering the imbalance distribution with more normal information and less harmful information in real network, we apply the smallest circle enclosing to divide the data area, and then get the biggest negative area. So, an index F-tree with high performance has been built for special content security filtering. And because F-tree can take most of normal data query with the biggest probability to negative area, F-tree can improve filtering performance comprehensively.In tiais paper, we proposed a concept fusion framework based on the analysis of existing feature fusion algorithms. We studied the following topics:the efficient representation of web page content, the efficient fusion of multi-modal information and the high performance of filtering technology. According to above topics, we proposed the efficient solution, which can improve the accuracy and speed of web page filtering. In the meantime, our research can provide the good basis of technology to manage and monitor multi-modal web page content.
Keywords/Search Tags:web page filtering, concept fusion, meaningful string extraction, Gaussian weight model, minimum enclosing circle, index filtering
PDF Full Text Request
Related items