Research On Chinese Text Classification Based On Sentence Ranking And Combinational Classification

Posted on:2008-05-27

Degree:Master

Type:Thesis

Country:China

Candidate:J G Lin

Full Text:PDF

GTID:2178360245997842

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Text classification (TC) is the process for computers to assign category labels to natural language texts automatically, according to predefined category set. It is an important research subject in computational linguistic and natural language processing (NLP) as well as one of the most widely used technologies of NLP. Early study of text classification was mainly based on knowledge engineering, created rules manually. With the development of machine learning and statistic natural language processing, machine learning based method has been widely applied in TC with favorable results.The work of this thesis is established on the basis of using machine learning algorithm to classify text automatically, devoted to the methodological research of improving the performance of TC system. Before doing this, we realized an integrative and configurable TC system using KNN and SVM with various methods of reduction feature space. We improve the performance effectively by making modifications of these TC methods. For different methods and different corpus, we have made detail comparisons between feature selection methods and then choosing the best one before trying to making improvements, thus making the final improvements more meaningful.Mainly in this thesis we provide two pieces of improvement idea. First, since the original text is half-structured or non-structured, we used unsupervised graph-based sentence ranking algorithm to rank the sentences. Once the sentences are ranked, we have a part of sentences with top rank, these sentences were more informative than other, and discard the other redundant content, thus minimizing the overlaps between documents. Further more,we applied weight adjustment to words according to the sentences'ranks, so that the essence of documents can be highlighted and the difference of documents increases. After that, we use KNN algorithm as classification methods, which got satisfying results. Second, according to the difference of the performance between vector space model and Latent Semantic Indexing on text presentation and classification, we use a method of combining the vector space model and Latent Semantic Indexing, which get the advantage of the two ways. Support vector machine (SVM) is used in this combinational classification. At the end, we also try to combine KNN algorithm and SVM on the basis of combining the vector space model and Latent Semantic Indexing, making improvement on the system performance, avoiding over-increasing occupation of system resource.

Keywords/Search Tags:

Text Classification, Sentence Ranking, Weight Adjustment, Combinational Classification

PDF Full Text Request

Related items

1	Classification Of Chinese Text Subject Classification And Emotion Based On Machine Learning
2	Study And Application Of Deep Features Learning In Sentence-Level Text Classification
3	Chinese Opinion Sentence Extraction Based On SVM Classification
4	Research And Implementation Of Recognition And Classification Algorithm For Sentiment Texts
5	Research Of Weight Algorithm In KNN Text Classification
6	Research On The Method Of Categorizing Emotions In Comment Text Based On Sentence Pattern Rules And Machine Learning
7	The Study Of Chinese Text Classification Based On Web
8	Research On KNN Text Classification
9	Study Of Chinese Text Classification
10	Research And Implementation Of Chinese Text Classification, Feature Selection Method,