Font Size: a A A

Research And Implementation Of Web Site Classification Based On Machine Learning

Posted on:2018-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:M Y ChengFull Text:PDF
GTID:2348330512488001Subject:Engineering
Abstract/Summary:PDF Full Text Request
The bad information in the Internet has been a long time. And they show an increasing trend in quantity, which mostly pornographic information, but also contains gambling,pyramid schemes and other illegal contents. To this end for the community to rectify the Internet environment contribution, the state has also introduced the corresponding laws and regulations to regulate the network environment, but the bad information is repeated,flooding. There are already many systems of blocking bad information in software or hardware way for our network environment more beautiful, but most of the system are"fragmented" and repeat their own blacklist library. The goal of this system is to provide a common database support for the interception system by proactively detecting the content of the website and establishing a shared database of bad information.In this paper, the new algorithm is applied to the task of classification of bad information by studying the image classification and text classification algorithm of depth learning. The deep learning algorithm has the ability of automatic learning extraction feature and has higher accuracy rate in classification for image recognition than the knowledge engineering or statistical method needs to manually extract the feature. In text classification algorithm, we propose new methods. In the new method, the long text of the page is first truncated into short text. Subsequently we classify short text. Then we summarize the classification results to get the pornographic proportions of the page text.And ultimately according to the different characteristics of the crowd, adjust the model pornography threshold to meet the needs of different groups of filtration. In image classification algorithm, the deep convolution model is the most effective, and the deep convolution model in recent years, the development, and has made great progress, and the development of several types of models, such as linear, local bimodal and local multi-branch type. In this paper, we study the performance of different types of models in bad picture classification tasks, and use the method of fine-tune to train deep convolution models. Finally, the most suitable image classification algorithm is selected according to the computational cost consumption of the model and the accuracy of the model. The System design takes full account of system scalability and portability, and can use old or idle equipment as a system work node, saving project funds.The system mainly includes five parts, i.e. network reptile module, text classification module, picture classification module, data storage module and data display module. The network crawler module, text classification module and picture classification module are the main research direction of this thesis.
Keywords/Search Tags:Content Classification, Deep Learning, Fine-tune, Network Purification, Web Crawler
PDF Full Text Request
Related items