Font Size: a A A

Research On Short Text Data Stream Classification Based On Feature Extension And Selection

Posted on:2020-11-22Degree:MasterType:Thesis
Country:ChinaCandidate:L HeFull Text:PDF
GTID:2428330575496924Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and social media,a large number of short text streams,such as news and Weibo texts have been popular on the social platforms.Unlike traditional normal texts,these data present the characteristics of short length,weak signal,high volume,high velocity,concept drifting,novel class emerging etc.Short text stream classification is hence a challenging and significant task to mine valuable information in data streams.Motivated by this,this dissertation focuses on the issues of high-dimensional and sparse feature,concept drift and novel class emerging in the classification of short text data streams.The main contributions are as follows:(1)To handle the issue of high-dimensional feature and concept drifts,we propose a new feature extension approach for short text stream classification with the help of a large scale semantic network obtained from a web corpus.First,more semantic contexts based on the sense of terms in short texts are introduced to make up of the data sparsity using the open semantic networks.And then,all terms are disambiguated by their semantics to reduce the noise impact.Finally,a concept cluster-based concept drifting detection method is proposed to effectively track hidden concept drifts.Extensive studies show that as compared to several well-known concept drifting detection method in data stream,our approach can detect concept drifts effectively,and it enables handling short text streams effectively and maintaining the efficiency as compared to several state-of-art short text classification approaches.(2)To handle the issue of sparse feature and novel label emerging,we propose a new short text streams classification based on feature selection and novel class detection.The proposed approach is built on the feature extension by Probase.Then it uses the Max-Relevance and Min-redundancy(MRMR)mechanism to select an optimal sub-feature space without the irrelevant and redundant features.In terms of this new feature space,feature selection is implemented.Finally,a novel label detection method is introduced to detect novel classes emerging in short text streams.Extensive experiments show that the proposed method performs effectively in the novel class detection in short text streams and achieves a better classification performance.
Keywords/Search Tags:Short text streams, Feature extension, Feature selection, Concept drift, Novel label emerging
PDF Full Text Request
Related items