Font Size: a A A

Research On Automatic Recognition Of Uncivilized Micro-Blog Posts

Posted on:2017-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:M GaoFull Text:PDF
GTID:2308330488485688Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, all kinds of social networks is changing people’s lifestyle. As a leader of these products, micro-blog further promotes the interaction between individuals and the world. However, the non-threshold access to micro-blog products also leads to unhealthy information spread by a large number of paid posters and malicious remarks without responsibility posted by normal users, which not only impact on micro-blog users’ physical and mental health of all ages, but also do harm to the entire network environment and even the social order.In order to purify the network environment and promote the construction of harmonious network language lifestyle, the supervision of micro-blog information needs to be applied, in which the automatic recognition of uncivilized micro-blog posts plays an important role. If the automatic recognition of uncivilized micro-blog posts is to be achieved, the classification of uncivilized tendency of micro-blog posts should be accomplished. The main work of this paper includes the following two parts.First, this paper puts forward the method to construct the uncivilized micro-blog corpus. Because there’s no suitable micro-blog corpus and, specifically, a certain scale of uncivilized micro-blog posts to carry out the research, the paper makes use of the API of Tencent Micro-blog to download micro-blog posts from the public time line, and then extract seed users. Large scale of user information and user posts can be obtained based on these seed users. Filtering rules are made to remove some meaningless micro-blog posts for the research. In order to extract potential uncivilized micro-blog posts, the list of uncivilized seed words is established to match a part of potentially uncivilized corpus, facilitating the following work. In the end, the uncivilized micro-blog corpus can be constructed by index after parsing micro-blog data files.Second, the paper puts forward the automatic recognition method of uncivilized micro-blog posts, whose key issue is the classification of short texts for micro-blog. We choose Naive Bayesian Classifier as the classification model, combined with the word based bigram model to segment texts and extract the key features of uncivilization. After completing the manually annotation for corpus, the Naive Bayesian Classifier will be trained and applied to classify. Because it remains to be unknown of the proportion of uncivilized micro-blog posts in the real network environment, we introduce the ratio of the positive and negative sample to adjust the training set and test set dynamically until the precision of classification achieves local optimum. For the content of uncivilized abbreviations that can’t be recognized by the model, we establish a list of uncivilized abbreviations based on the previous step, realizing the uncivilized micro-blog recognition method based on the abbreviations of uncivilized micro-blog, which further improves the recognition effect. It shows the application of the automatic recognition system of uncivilized micro-blog posts in micro-blog public opinion monitoring by giving examples at the end of the paper.
Keywords/Search Tags:micro-blog, text classification, naive bayes, bigram language model
PDF Full Text Request
Related items