Font Size: a A A

Study On Key Techniques Of Text Content Classification And Topic Tracking

Posted on:2009-12-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Z WangFull Text:PDF
GTID:1118360308478445Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Nowadays the Internet has become an important tool for people to quickly acquire and exchange information, however it also brings more challenges. People are in great need of an efficient and accurate technique to help them process the vast information. Therefore text information processing technologies such as information retrieval, information filtering and classification, topic detection and tracking emerged and received more and more attention. Currently the research on text content classification and topic tracking has become the hotspot in the domain of natural language processing. Different applications and demands usually require deep analysis and processing over texts. In this paper we studied the crucial problems of text content classification and topic tracking, and proposed corresponding solutions. Large amount of experiments have shown the effectiveness of these solutions. Our main contributions include:(1) In order to improve the performance of text classification, we studied the abilities of features to distinguish different categories. Reasonable evaluating methods are proposed to select features with good discrimination capability so as to enhance the discrimination capability of classifier. This paper proposed a feature selection method based on discrimination capability. This method uses overall-divergence to measure the ability of each feature to distinguish different categories. Experimental results show our proposed method significantly improved the performance of text classification on confusion data set. On commonly used data set, our method achieves better or comparable performance compared to the best feature selection method.(2) In order to further improve the performance of text classification, the confusion class recognition technology is mainly studied to solve the problem of confusion classes in text classification. First we proposed a confusion class recognition technique based on Classification Error Distribution (CED), which can identify the confusion class set in the pre-defined classes. In order to effectively classify the texts belonging to confusion classes, we constructed a confusion class classifier with good discrimination capability based on the overall-divergence feature selection method. This paper designed and implemented a Two-Stage Classifier, which integrated the initial classifier and the confusion class classifier. The classifying results outputted by the two stages are combined as the final output. Experimental results show that on Newsgroup and 863 Chinese evaluating corpora, the confusion class recognition and discrimination techniques significantly improved the classification performance under Single-Label and Multi-Class Classifier framework.(3) Studied the key technologies in spam filtering. First, we looked for efficient filtering algorithm with low computation cost and high speed. Second, considering the characteristics of the spam content may change quickly, we studied the spam filtering technology with feedback ability and self-adaptative ability. We proposed spam filtering technology based on two-layer content analysis and the spam filter is designed and implemented. The first layer is fast content filtering, in which Naive Bayesian Classifier is applied to filter emails for the first time. And the suspected spam is forwarded to the second layer for further analysis by using the second-level content filtering module. Since the characteristics of the spam content may change quickly, we also proposed spam filtering technique based on feedback learning and adaptive learning technologies and applied these technologies to the preliminary-hearing/review collaborative spam filtering framework. The system achieves good filtering performance under real-world corpora and real-time network environment.(4) Focusing on the difficulty in topic tracking that the predefined topic lacks concreted and accurate description, we studied the topic representation method and multi-vector model is proposed. This model uses multiple vectors to represent texts and extract important features from texts into a single vector so as to improve the performance of Chinese topic tracking. Since named entities are very important for the representation of text content, a single vector of named entities is extracted and used in topic tracking accordingly. Experimental results on TDT4 Chinese corpus show that multi-vector model can improve the performance of topic tracking system.(5) Focusing on the topic-drifting phenomenon in the topic tracking task, we analyzed the reasons and characteristics of this phenomenon, and proposed time adaptive boosting model. This method is based on the idea of adaptive boosting. This paper also proposed an adaptive technique based on active learning which used stream-based active learning framework. Both two methods are unsupervised, and can improve the adaptive learning ability of topic model in which the weights of features will be tuned simultaneously. According to the topic timing characteristics, the concept of time factor is introduced to the tracking system. Experiments on TDT4 Chinese corpus show that the two techniques can partly solve the topic-drifting problem and improve the topic tracking performance.Currently most content analysis-based text processing techniques make assumptions on feature independence. Such assumptions are invalid in practical situations. On contrast, Basian Network makes only conditioal independence assumption as such interdependence information among features can be incorporated into the learning process. In the following, we examine the applications of Basian Network to Text Categorization, Information Filering, and Topic Tracking.
Keywords/Search Tags:text content classification, topic tracking, spam filtering, feature selection, confusion class recognition, multi-vector model, topoc drift
PDF Full Text Request
Related items