Font Size: a A A

Researches On Hierarchical Chinese Text Classification

Posted on:2005-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:F Y XuFull Text:PDF
GTID:2168360152967869Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
The exponential growth of online information has raised a new challenge for information retrieval. In order to reach a much more efficient retrieval system, the vast data online should be classified automatically. In a flat classification, there may be a lot of categories and the time consumed on model formation is quite long. To be more economic, this paper presents a new hierarchical automatic classification on Chinese texts, with some improvements on term weighting and reduction of dimensionality as well.A text's class can be determined by a certain number of sorted terms, so feature selection has a great impact on the performance of classification. If there is a great variance between the features selected by different classes, a better classifier can be designed. So it is indispensable to do effective term weighting to important terms. Meanwhile, reduction of dimensionality is necessary to reduce the example space scale and save time for calculation.By using a weighting algorithm combined with the traditional IDF (Inverse document frequency) and a new distribution information method, and by presenting a new concept-LFHW (Low frequency but High Weight terms), this system gives a new approach to term weighting. In addition, term importance test is proposed after comprehensive term weighting and filtering, ensuring the reduction of dimensionality without affecting precision of classification.The experimental results prove that a hierarchical classification algorithm has a better performance both on precision and speed, compared with the flat classification. Moreover, the improved algorithms of term weighting and reduction of dimensionality also show their effectiveness.
Keywords/Search Tags:Automatic text classification, hierarchical, feature selection, low frequency but high weight terms, distribution information
PDF Full Text Request
Related items