Font Size: a A A

Research On Automatic Recognition Of Chinese Short Text

Posted on:2018-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y L BiFull Text:PDF
GTID:2348330515969714Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In current society,various social platforms and instant messaging tools used in information interaction are becoming more and more popular.They usually use short text as medium to transfer and exchange information currently,because it is convenient,fast,efficient and suits today's informational and fast-paced life.The so-called short text mainly refers to short messages,micro-blogs,commodity reviews,forum posts and other types of text with short length or limited words.In the short text,there are a large amount of illegal information,such as spam messages,advertisement micro-blogs,fake reviews and etc.Because the short text is short and has limited words,in addition,its edition is open and nonstandard.Thus,when recognize the spam short text through binary classification,we will encounter three problems:(1)the noise in data set is large.(2)the training data set is unbalanced.(3)the feature vector will be sparse and high dimensional if we use the vector space model based on dictionary.The above three problems are studied in this thesis and the main work includes:1)Propose a preprocessing method that suited for short text to standardize the data set,it mainly includes typos correction,convertions from traditional Chinese characters to simplified Chinese characters,convertions from uppercase letters to lowercase letters,the uniform representation of the same information and etc.2)Abstract feature items according to the edition grammar,the words characteristic and etc in the text content and structure attributes in the non-content of short text to avoid the feature vector is sparse and high dimensional.3)Propose the ensemble classification method “Random Forest+Adaboost” that the random forest is used as the base classifier of adaboost to lower the impact of the noise in data set.Because of the similarities between short messages and commodity reviews are large,we select short messages and commodity reviews to use as research object and use the method proposed in this thesis to study the recognition of spam short text.Finally,experiments are done on the large short message data set provided by China Mobile and the commodity review data set provided by COAE 2015 task four.The results show that the proposed method in this thesis is effective and the ensemble algorithm “Random Forest+Adaboost” has some advantage over other classification algorithms.
Keywords/Search Tags:short text, short message, commodity review, random forest, adaboost
PDF Full Text Request
Related items