Font Size: a A A

Research On Blocking Reduction Strategies In Hierarchical Text Classification

Posted on:2007-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:X HuFull Text:PDF
GTID:2178360212465576Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The proliferating electronic information makes people get in trouble with finding what they want. In order to get better information organization and management, researchers have brought forward text classification (TC), especially hierarchical text classification (HTC). Compare to the big-bang approach, the top-down approach adopted by the existing hierarchical classification methods use the information carried by the category structure much better, but this approach suffers from blocking which refers to documents wrongly rejected by a ancestor category can not be classified into the right leaf category.Due to the lacking of classification accuracy caused by blocking, the blocking reduction strategies(BRS) have been widely concerned and researched in recent years. On the base of another's work, BRS adopting threshold reduction approach is amply researched in this paper. Firstly, basic concept and knowledge on TC and HTC are summarized, and the affection of blocking to the classification accuracy are analyzed. According to the character of existing BRS, we categorize them into three approaches, namely, threshold reduction approach, multiplicative approach, and classifier committees approach, and their merits and demerits are analyzed. The difference and relationship between SCut in TC and threshold reduction method are discussed.On this base,considering change the threshold strategy used in threshold reduction method for another threshold strategy which has less possible parameter value, beam search based BRS is put forward, experiment result shows it can reduce blocking, increase leaf category's recall and the entire system's F1M, keep high precision. Considering when other classifiers'thresholds were decided, the change of classifier Ci's threshold will only influence the accuracy of category in its word domain, predict based BRS is put forward, experiment result shows it can reduce blocking, increase leaf category's recall and the entire system's F1M , but precision decrease a little. With the idea of PCut in TC, and make use of the score distribution of the liner classifier, probability density estimation base BRS is put forward, experiment result shows it reduces blocking and increase leaf category's recall very well,but the low precision makes the entire system's F1M decrease.On the base of detailed explaining for the three strategies, the standard hierarchical text classification(SHTC) method and threshold reduction approach adopted BRS are evaluated on the Reuters21578 collection, and the experiment result of probability density estimation base BRS is analyzed. At last S-test is performed to determine the performance of threshold reduction approach adopted BRS against SHTC, result shows using a similar amount of computing time to the threshold reduction method, predict based BRS sets a most fit threshold for each inner classifier, it can reduce blocking also improve more leaf category's accuracy, so we figure this effect of predict based BRS on reducing blocking and improving accuracy is most significant.
Keywords/Search Tags:Text Classification, class taxonomy, blocking, beam search, probability density estimation
PDF Full Text Request
Related items