Font Size: a A A

On Optimizing Label Weights Automatically For Web Texts Classification

Posted on:2016-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:X D ZhongFull Text:PDF
GTID:2298330467494925Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the flourish of the Internet and the advent of mobile Internet era, people’s work and life become increasingly dependent on the Internet, and the Internet has become the major access to relevant information for people. Therefore, efficient web data mining technologies are necessary. Since web page classification is the supporting technology for web data mining, it becomes a very important research topic.In this paper, we first analyze how to effectively extract the features of web pages. Then, we introduce the importance of automatic tuning of weight coefficients for different labels. Besides, we also discuss the basic principles of the various optimization algorithms, and analyze their corresponding advantages and disadvantages in detail. Finally, we propose an improved differential evolution (DE) based scheme to achieve automatic selection of label weights. The specific work is as follows:(1) To address the defect of differential evolution that it tends to get stuck in local optima easily, we propose a corresponding improved algorithm. Compared with other optimization algorithms, differential evolution algorithm has better efficiency and global optimization capability. However, it also has significant disadvantages. The local search ability of differential evolution algorithm is weak, rendering it easy to fall into local optimal solution. In view of this, we propose an improved differential evolution algorithm to enhance the search ability of DE. Through benchmark function, it is verified that proposed algorithm is more effective.(2) For the shortcomings of manually specifying weight coefficients to labels, we design and implement an improved differential evolution algorithm to search the optimal label weights automatically. Different HTML tags for the web page offer different ability of summarizing page content. Therefore when expressing the Web page text, in order to take advantage of the semi-structural features of Web page, we need to give different weight coefficients for different labels. Existing web page classification technique is based on personal experience to manually specify the label weights, which has a certain randomness and cannot adapt to the changes in sample set. Therefore, effective optimization algorithms for automatic weight labeling are necessary. In this paper, we employ the proposed optimization algorithm to search the best weights of a set of labels, and the experimental results demonstrate that the proposed algorithm can fully take advantage of the characteristics of the sample set, and further improve the accuracy of classification effectively.(3) We design a system for automatic web page training and prediction. When training, we utilized the proposed optimization algorithm to search the label weights intelligently. This system consists of different components, including HTML analysis, participle, feature selection, feature expression, and classification model design. Besides, this system also has the function of specifying label weights automatically.
Keywords/Search Tags:Data Mining, Differential Evolution, Selection Strategy, Web PageClassification, Semi-Structure
PDF Full Text Request
Related items