| With the development of Internet technology,network has become the main information source for individuals and enterprises.These rich and variety information resources bring convenience to people,however,they are also filled with a lot of harmful information,such as reactionary,pornography,drugs,gambling,and products advertising for illegal marketing.There harmful information not only hinders the construction of green and healthy Internet environment,but also hampers the process of obtaining information.As text information has a large proportion in the network information,the research on harmful text filtering technology is of significance application meaning in cleaning up the whole of network information.Thus,useful text information can be obtained quickly and effectively.Based on the Na(?)ve Bayes algorithm with Vector Space Model(VSM),a harmful text filtering technology for large number of mobile network information is proposed,and the methods and models contained in it are investigated and improved.Finally,harmful text filtering for the specified system is implemented.The main research work and contributions of this dissertation are as following:(1)Taking VSM as the text representation method,the set of category central vector is defined by improving the feature selection method.And through optimizing the model of the Na(?)ve Bayes algorithm,the classification algorithm for text filtering is trained,which lays the foundation for the follow-up technology.(2)Based on the Na(?)ve Bayes algorithm,a harmful text filtering technology,which introduces the hypothesis testing idea,is proposed to filter the harmful text.First of all,the Ansj Chinese text segmentation method is used for Chinese segmentation.Then,the Na(?)ve Bayes classification algorithm,which is based on VSM,is combined with harmful text filtering.Finally,the set of classificatory threshold is applied to achieve the filtering of harmful text.(3)The web crawler is written in the Java language.Using the Jsoup open source HTML parser for analyzing the web page structure of each designated website,the grasping of corpus information is realized.Accordingly,the information of application system is analyzed to screen corpus.Finally,the final corpus is formed.(4)The Eclipse is used to development the test platform of harmful text filtering technology based on the Na(?)ve Bayes algorithm.The feasibility of the filtering technology is verified by a set of basic tests,and through three sets of comparative tests,the effectiveness of other improvement in this technology are further proved. |