Font Size: a A A

Research On URL-Pattern Based Algorithm For Web Page Classification

Posted on:2017-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y M YangFull Text:PDF
GTID:2308330485453709Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and mobile Internet, it becomes a challenging task to organize and manage the massive Web pages effectively. As a fundamental part of Web mining, Web page classification research plays an important role in many fields of Web mining, such as search engine, topic crawler and maintenance of directory sites.Content-based Web page classification technology mostly extracts various features from the content, hyperlink structure, neighboring pages and etc, and then uses supervised learning methods for classification. While URL-based Web page classification is just based on URLs of Web pages for classification. Although web page classification has been extensively explored, the existing methods rely on feature engineering heavily, and the training time of the model is usually too long. Also these methods are too sensitive to noise to be used in datasets with much noise. In addition, the existing methods do not take incremental learning into account at all, thus they may not be suitable for online learning.In this dissertation, we propose an efficient Url Pattern based Classification Algorithm (named UPCA). According to the training set with the same label, we can construct a pattern tree, and extract URL patterns from it. In this way, we can obtain the main URL pattern corpus for specific type of Web pages. The URL pattern corpus can represent the structure characteristics of URLs that belong to the corresponding type. For the new page, we only need to match its URL with the pattern corpus. If matched, the page belongs to the type. In addition, we propose an efficient incremental pattern tree algorithm. For the new training samples, we update the existing pattern tree instead of rebuilding the pattern tree. Also, the upper and lower bounds of influence of new training samples on the information entropy of a key are given in this dissertation.Finally, the results of experiments on real datasets demonstrate that UPCA algorithm achieves promising results in terms of both classification accuracy and computational efficiency. And the proposed incremental pattern tree algorithm can be applied to the situations where training data is often expanded.
Keywords/Search Tags:URL pattern, Web page classification, Web mining
PDF Full Text Request
Related items