Font Size: a A A

Research And Implementation Of Chinese Web Page Classification Technology Based On Self-learning Of Keywords

Posted on:2018-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:L JuFull Text:PDF
GTID:2428330596453025Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,in order to meet the demands of people's seeking for information on the Internet,more than 4,500,000 websites provide netizen with knowledge and information involved in various fields.Due to the large number of websites,the content and form of web pages have become irregular.For the purpose of classifying the large number of Chinese web pages efficiently,it's not desirable to apply the technology used to classify foreign language web pages without any improvement.However,the traditional Chinese web page classification technology based on machine learning can't keep up with the changes of the complex environment of the Internet.At this stage,how to improve the efficiency and accuracy of classification is a thorny problem.This study is aimed at the technology used to solving the problem of classifying large-scale Chinese web page.This paper studies on how to use a variety of technologies to extract text content,structure information and link information from the source code of web page.Based on these extracted information,study builds a feature model which can represent the web page and is called Keywords Model.Then,using this model as the input of the model of convolution neural networks that is a combination of the convolution neural networks and support vector machine,the modified model can output the category of the input web page.The main research of this paper includes:(1)The improvement of extraction method for web information and transfer these information into the Keywords model.After the analysis of the source code of web page,this model can acquire the layout information.According to the amount of information the block contains,dividing the web page into primary and secondary blocks.The model can extract the text content,hyperlinks and web labels from the primary.After the word segmentation of the text content and the shifting the other information into the form of Keywords model,the model combines the two main parts in the three-dimensional form to construct the final Keywords model for the input web page.(2)Perfection of algorithm for web page classification based on self-learning.This part studies the application of convolutional neural network algorithm in Chinese web page classification.Taking the advantage of the Local Receptive Fields of CNNs,CNNs are able to make a self-learning of deep information from web pages,perceiving features from the local to the global of the web and getting class features at higher levels automatically.The characteristic of weight sharing can reduce the difficulty of neural network training.By cascading the convolution neural networks and support vector machine,the novel technology can lower the dimension of input of SVM and promote the accuracy of classification.In the process of training,this technology chooses to train the traditional CNNs with the same structure.After finishing this,the parameters that are obtained from the traditional one will be shift to the modified model.And then,the CNN-SVM model should still be trained on a set of samples to adjust other parameters in the model.This approach will cut down the training cycle and consumption of computing resources.(3)The implementation of the secondary channel based on the feature argumentation.Due to the sparsity of information of the Keywords model for some page,this channel will find out augmented information for these pages.Through the URL links in the original page,the matched URL links will be accessed and the important information on these page will be taken to expand the Keywords model of the original one.Or it is also useful to take the title of the original page as keywords to be searched on the internet.And the matched ones of the search results will be part of the Keywords model of the original page.These operations will be packed in the form of secondary channel and a necessary process in the whole system.(4)System implementation and performance test.On the basis of the study of the construction of Keywords model and classification algorithm,this paper realize the system that can classify Chinese web page based on the self-learning of keywords features.The paper does a plentiful of experiments on the performance of Keywords model,the CNN-SVM model and the technology of feature augmentation in the secondary channel.The results show that these proposed methods can effectively improve the performance of classification and the system is able to deal with Chinese web page classification in large-scale.
Keywords/Search Tags:classification of webpage, keywords model, self-learning, convolution neural networks, secondary channel
PDF Full Text Request
Related items