Font Size: a A A

Research On Short Text Categorization Based On Phrase-Like Repeat And Semi-Supervised Learning

Posted on:2011-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y H CaiFull Text:PDF
GTID:2178360302993972Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
After a decade of China's Internet rapid development, the netizen scale has reached 338, 000, 000, the universality rate has reached 25.5%.Web is now the most important way for man to acquire information. The short text arising from human interactivity has been predominant in Internet information flow. In the era of mobile communications, the mobile phone short message has become an indispensable part in people's lives. All kinds of topics and opinions are expressed in huge amount of short text databases, and short text communication is changing the pattern of information dissemination drastically. Short text mining technologies may be widely used in topic tracking and detection, catch word analysis, public opinion forewarning and other applications. Short text classification is the key technology of topic tracking and detection.Aiming at the unique language characteristic of short texts, the key technologies of short text classification to be dealt with are thoroughly. The main contributions of this dissertation are summarized as follows:Firstly, a feature selection method based on Phrase-Like Repeat (PLR) is presented. On the basis of analysis on the limit of traditional text representation model, we propose the concept of PLR and a novel representation method based on PLR, which can acquire terms by extracting PLR with strong text representation function from text. The proposed representation method can enhance integrated and independent degree of terms, and can overcome the limitation of VSM (Vector Space Model).Furthermore,we propose a feature selection method based on PLR. The experimental results demonstrate the proposed method can improve the qualities of short text classification algorithms, and reduce the dimension remarkably.Secondly, a semi-supervised short text categorization method based on ensemble learning is presented. A feature selection ensemble algorithm based on EM is proposed to overcome the limitation of the attributes independence assumption and to improve the generalization ability of semi-supervised EM algorithm. Experimental on real corpus show that the proposed method is more effectively exploit unlabeled data to enhance the learning performance, and is superior to semi -supervised EM in the learning efficiency and the classification generalization.Thirdly, the author considers topic tracking as example to describe the application of Short Text Categorization Based on PLR and semi-supervised learning, presents a new BBS topic tracking method based on PLR and semi-supervised learning. The experimental results show that the new method gets well effect.
Keywords/Search Tags:short text, text classification, semi-supervised learning, Phrase-Like Repeat, ensemble learning
PDF Full Text Request
Related items