Font Size: a A A

Research On Chinese Text Classification Based On Sentence Ranking And Combinational Classification

Posted on:2008-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:J G LinFull Text:PDF
GTID:2178360245997842Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text classification (TC) is the process for computers to assign category labels to natural language texts automatically, according to predefined category set. It is an important research subject in computational linguistic and natural language processing (NLP) as well as one of the most widely used technologies of NLP. Early study of text classification was mainly based on knowledge engineering, created rules manually. With the development of machine learning and statistic natural language processing, machine learning based method has been widely applied in TC with favorable results.The work of this thesis is established on the basis of using machine learning algorithm to classify text automatically, devoted to the methodological research of improving the performance of TC system. Before doing this, we realized an integrative and configurable TC system using KNN and SVM with various methods of reduction feature space. We improve the performance effectively by making modifications of these TC methods. For different methods and different corpus, we have made detail comparisons between feature selection methods and then choosing the best one before trying to making improvements, thus making the final improvements more meaningful.Mainly in this thesis we provide two pieces of improvement idea. First, since the original text is half-structured or non-structured, we used unsupervised graph-based sentence ranking algorithm to rank the sentences. Once the sentences are ranked, we have a part of sentences with top rank, these sentences were more informative than other, and discard the other redundant content, thus minimizing the overlaps between documents. Further more,we applied weight adjustment to words according to the sentences'ranks, so that the essence of documents can be highlighted and the difference of documents increases. After that, we use KNN algorithm as classification methods, which got satisfying results. Second, according to the difference of the performance between vector space model and Latent Semantic Indexing on text presentation and classification, we use a method of combining the vector space model and Latent Semantic Indexing, which get the advantage of the two ways. Support vector machine (SVM) is used in this combinational classification. At the end, we also try to combine KNN algorithm and SVM on the basis of combining the vector space model and Latent Semantic Indexing, making improvement on the system performance, avoiding over-increasing occupation of system resource.
Keywords/Search Tags:Text Classification, Sentence Ranking, Weight Adjustment, Combinational Classification
PDF Full Text Request
Related items