Font Size: a A A

Research On Application Of Fuzzy Theory In Text Classification

Posted on:2012-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:J LouFull Text:PDF
GTID:2178330338497697Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Text classification is a process of assigning a digital textual document into one or more predefined semantic categories according to different semantic concepts. Due to the flourish of World Wide Web and the rapid development of the Internet technology, the increasing volume of digital textual data become more and more unmanageable, therefore the importance of text classification has gained significant attention. The demand of a high precision method that performs automatic text classification is inevitable to reduce the negative impact of this information boom. As a key technology in document mining, information retrieval and web search, automatic text classification plays a significant role in these realms.The current dominant text classification approaches which are based on machine learning are mainly combined with statistical theory, and take advantages of the statistical properties of text features as measurement manner. Its core processes include text preprocessing, feature dimension reduction, weight calculation, classifier learning, classification results and performance evaluation. Analyses reveal that the ambiguity of natural language makes the association among features difficult to be clearly defined and explicitly represented by the statistical tools. As a result, fuzzy theory is introduced and the fuzzy concepts are used to represent the semanteme of features and text category properties. Classification result is no longer an absolute subjection of a certain category but a membership degree by which category is decided. Thus fuzzy text classification is educed.A text classification approach based on fuzzy relationship which representing the category of semantic unit has been proposed in the thesis, which conforms to the natural language characteristic, and achieves higher classified precision. Through defining membership function of term-text relationship and term-class relationship, the test text and categories could be represented as fuzzy sets. Then evaluating the membership degree between test text and each category by computing the correlation coefficient of fuzzy sets, the test text's category is decided via using maximum membership principle after obtaining a fuzzy set of text category.In fact, one text may also belong to one or more categories or its category cannot be clearly decided. Multi-label text classification concerns the determination of categories in the situation which one document may belong to more than one category. In this thesis, an improved multi-label text classification approach based on aforementioned fuzzy relationship is proposed. After using multiple categories vector to represent the fuzzy association degree among categories, the scores of the test text correlation to the categories are recalculated. A heuristic approach able to find a score threshold automatically for each category is provided as well. The test text is then marked by every category whose score passes a threshold.On Chinese text classification system platform, the first experimental comparisons based on fuzzy relationship method and k-NN classifier use kinds of indicators to evaluate the performance, and the results have revealed that the precision is increased and the classification process is speeded up to a considerable degree. In the same system environment, the second experimental results have shown that the multi-label fuzzy relationship method can obtain correct category, which proves the approach is effective and efficient.
Keywords/Search Tags:Text Classification, Fuzzy Theory, Membership Function, Fuzzy Relationship, Multi-label
PDF Full Text Request
Related items