Font Size: a A A

Concept-based Short Text Classification

Posted on:2017-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:Z W CaiFull Text:PDF
GTID:2348330536953465Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With fast development of micro-blog and user comments,a great amount of short text appears in Internet which contains a wealth of information and resources in a short text data.How to make full use of these information resources management,as well as helping people quickly find the information they need,has become a major challenge to information processing technology.Text classification technology is the key technology of organizing and processing massive document data.Currently,most of the text classification approach is proposed for the long text classification.Unlike long text,however,short text has less keywords and more sparse features,and its context and semantic information is not complete ambiguous,making the traditional text representation in the short text processing features hard to evaluate the usefulness of features for the classification.Short text is usually expressed in refined slightly,insufficient information,which makes text classification difficult.But we can try to introduce some information from the existing knowledge base to strengthen the performance of short text classification.Wikipedia is now the largest human-edited knowledge base of high quality.It would benefit to short text classification if we can make full use of Wikipedia information in short text classification.This paper presents a new concept based on Wikipedia short text representation method,by identifying the concept of Wikipedia mentioned in short text,and then expand the concept of wiki correlation and short text messages to the feature vector representation.On the other hand,as a supervised learning process,the short text classification requires a great amount of labeled samples as training set,which will take a relatively large labor or economic costs.In traditional supervised learning problems,active learning is a field that tries to select samples that can improve classification accuracy to reduce the amount of needed labeled samples.The most popular active learning methods used is the uncertainty sampling method.But uncertainty sampling usually fails to improve classification performance in short text classification by selecting the outlier sample.This paper proposes Top-k representative based sampling selection method,and propose a greedy approximation algorithm to optimize the problem.Experiment results show that the selection of training samples of the proposed method is superior to the compared method in this article,which can reduce the workload of labeling labor short text categorization.
Keywords/Search Tags:short text classification, Concept recognition, Active learning
PDF Full Text Request
Related items