Font Size: a A A

Research On Text Classification And Its Related Technologies

Posted on:2006-09-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:R L LiFull Text:PDF
GTID:1118360155960406Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development and spread of Internet, electronic text information greatly increases. It is a great challenge for information science and technology that how to organize and process large amount of document data, and find the interested information of user quickly, exactly and fully. As the key technology in organizing and processing large mount of document data, text classification can solve the problem of information disorder to a great extent, and is convenient for user to find the required information quickly. Moreover, text classification has the broad applied future as the technical basis of information filtering, information retrieval, search engine, text database, and digital library and so on.Research on text classification and its related technologies are done in the paper. From the angle of improving the speed, precision and stability, several methods and techniques are presented. Moreover, research on text genre classification, which is a new research field in text classification, and information filtering, which is an important application of text classification are also done. Our primary works are as follow.(1) Selection of Training SamplesSelection of training samples has the great important influence on the performance of classifier. Using the atypical samples not only increases the training time, but also is apt to bring the noise to training samples. In the paper, what is the typical sample of KNN is analyzed, and a method of samples selection based density is presented. The number of samples in the e -neighborhood of a specified sample is used to estimate the density of region surrounding the sample. The number of classes in the e -neighborhood of a specified sample is used to judge whether the sampe is around the border of classes. Reduce the atypical samples by reduce the samples in the high-density region. In the same time, reserve the samples around the border of classes in order to guarantee the precision of classifier.(2) Research on Chinese Text Classification Based on Maximum Entropy Model There are many differences between Chinese text classification and English textclassification. So the classification results are also different. It is espically different for maximum entropy model because the entropy of Chinese is higher than that of English. In the paper, two kinds of methods of Chinese text feature generation, word segmentation and n-gram, are used. Absolute-discounting technique is adopted to smooth the feature probability. Maximum entropy model, Naive Bayes, KNN and SVM are compared. Experiment results show that maximum entropy model isn't stable enough. So bagging is used to improve the stability of maximum entropy model.(3) Using Hierarchical Classification to Improve the Performance of Flat...
Keywords/Search Tags:Text Classification, Text Genre Classification, Information Filtering, Samples Selection, Maximum Entropy Model, Hierarchical Classification, N-Gram
PDF Full Text Request
Related items