Font Size: a A A

Research On Network Unhealthy Information Identification Based On Rules And Statistics

Posted on:2018-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:L F LiuFull Text:PDF
GTID:2348330518482366Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet, to the community and people's lives has brought great and far-reaching impact. Internet as a carrier of information dissemination, compared with the traditional paper media has incomparable superiority for different areas such as political, economic, cultural and other information dissemination provides a high quality platform, but also for the exchange between people to create a Kind of new way.The Internet to bring convenience to people at the same time, but also bring some negative effects. Virtual network environment, each user has been transformed into a string of virtual symbols, through personal web pages, microblogging, WeChat public numbers, forums and other forms of online media information published, have a certain degree of uncertainty, even if Many platforms to take a certain pre-audit, after the filtering measures, but there are still some identity hidden, moral awareness,cultural literacy of the existence of poor, making a large number of false, pornographic,politically sensitive, fraud, superstition and other information Full of the corner of the network, corrupt the social atmosphere, demagogic people, to people's physical and mental health caused great damage.As a huge amount of user network social media, microblogging is a user relationship based on the information sharing, dissemination, access to the platform,the user posted microblogging messages can be promptly pushed through the client or platform to the fans, to achieve real- Fast information dissemination. At the same time microblogging fans can also be published by commenting and bloggers to interact, or can be forwarded, commented, collection and other operations, to achieve information sharing, dissemination, expand the scope of information dissemination, enhance the influence of information. Microblogging this feature also led to microblogging become a bad information hiding place. So microblogging has become the object of many scholars.In order to purify the network environment, so that minors away from bad information against the Internet users to provide a good search experience,it is necessary to control the release and dissemination of these bad information, take appropriate measures and means to strengthen supervision and management. To this end, this paper to the network of bad information for the purpose of identification,combined with the existing Chinese text mining technology to carry out experimental research. Through the crawler program to collect microblogging users for a specific microblogging text to comment and forward the content, get the original data. And then use the obtained data to extract the feature set of the text by removing the irrelevant symbols, the word segmentation processing, the dependency relation annotation, the word frequency statistics and so on. In order to improve the accuracy of word segmentation, this paper designs the bad lexicon, which contains the basic vocabulary, the synonyms, the abbreviations, and the relativity between the words. The algorithm based on the feature extraction algorithm and the dependency Analysis,combined with the effective extraction of text features, and the use of naive Bayesian algorithm to achieve a text classification model. The model is applied to the classification of user comments in microblogging, and the classifier is tested by experiment. Compared with the pre-improvement, the accuracy and recall rate of classification are obviously improved. Finally, the author summarizes the research of this paper, puts forward the innovation and shortcomings of this paper, and continues to improve in the follow-up research process.
Keywords/Search Tags:Feature Extraction, Text Categorization, Nave Bayes, Recognition of Bad Information
PDF Full Text Request
Related items