Concept-based Short Text Classification

Posted on:2017-08-31

Degree:Master

Type:Thesis

Country:China

Candidate:Z W Cai

Full Text:PDF

GTID:2348330536953465

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With fast development of micro-blog and user comments,a great amount of short text appears in Internet which contains a wealth of information and resources in a short text data.How to make full use of these information resources management,as well as helping people quickly find the information they need,has become a major challenge to information processing technology.Text classification technology is the key technology of organizing and processing massive document data.Currently,most of the text classification approach is proposed for the long text classification.Unlike long text,however,short text has less keywords and more sparse features,and its context and semantic information is not complete ambiguous,making the traditional text representation in the short text processing features hard to evaluate the usefulness of features for the classification.Short text is usually expressed in refined slightly,insufficient information,which makes text classification difficult.But we can try to introduce some information from the existing knowledge base to strengthen the performance of short text classification.Wikipedia is now the largest human-edited knowledge base of high quality.It would benefit to short text classification if we can make full use of Wikipedia information in short text classification.This paper presents a new concept based on Wikipedia short text representation method,by identifying the concept of Wikipedia mentioned in short text,and then expand the concept of wiki correlation and short text messages to the feature vector representation.On the other hand,as a supervised learning process,the short text classification requires a great amount of labeled samples as training set,which will take a relatively large labor or economic costs.In traditional supervised learning problems,active learning is a field that tries to select samples that can improve classification accuracy to reduce the amount of needed labeled samples.The most popular active learning methods used is the uncertainty sampling method.But uncertainty sampling usually fails to improve classification performance in short text classification by selecting the outlier sample.This paper proposes Top-k representative based sampling selection method,and propose a greedy approximation algorithm to optimize the problem.Experiment results show that the selection of training samples of the proposed method is superior to the compared method in this article,which can reduce the workload of labeling labor short text categorization.

Keywords/Search Tags:

short text classification, Concept recognition, Active learning

PDF Full Text Request

Related items

1	Research On Distributed Classification Methods For Short Text Data Streams
2	Research On Short Text Classification Based On Its Own Features
3	Research On Short Text Classification Method Based On Contextual Feature Expression
4	Short Text Classification Based On Apriori Algorithm
5	A Study For Classifying Short Text In Social Network
6	Research On Chinese Short Text Classification Based On Hybrid Neural Network
7	Research On Chinese Text Classification Algorithm Based On Active Learning Approach
8	Research And Application Of Short Text Classification Algorithm Based On Deep Learning
9	Research On Short Text Classification Based On Deep Learning
10	The Design And Implement Of A Mongolian Text Classifier Based On Active Learning SVM