Font Size: a A A

Automatic Chinese Text Categorization Based On Associate Rules

Posted on:2008-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:K YangFull Text:PDF
GTID:2178360215490935Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development and spread of Internet, the size of electronic information is increasingly enlarged. In the area of information science and technology, it is a comprehensive issue to look for latent as well as interested information and knowledge quickly, accurately and completely. Data mining (DM) technology is a way to solve the problem. The researches on DM, which involve association analysis, categorization analysis, cluster analysis, trend analysis, etc., have been conducted deeply since the concept of DM was proposed in 1990s. Structured data such as relational database is the main research object of DM, however, a majority of information exists with the form of unstructured data practically. So, mining the unstructured information successfully is becoming a new challenge.Text data, which is a kind of information form used most among common unstructured data such as text, image and video, etc., is often used in digital library, product catalog, news group, medicine report, organization or individual homepages, and is also applied broadly to natural language understand, text summarize, information extract, information filter, information retrieval, etc.. Therefore it owns enormous commercial value.This paper deals with Chinese text association categorization, which considers text data as a research object. It mainly concerns on text feature extraction, feature selection, feature vector space denotation and analysis of text association, text association categorization. The author proposes a more efficient algorithm. The primary work includes:①The research on feature selection and vector space denotationCurrently, common methods of text denotation involve Boolean matrix, term frequency matrix etc. While denoting text vectors with a Boolean matrix, the main advantages are the concise denotation and high efficiency in computing, yet the weak point is that we just consider the appearance of features, so it will lead to the inaccuracy. However, by using of term frequency matrix, though its demonstration is more accurate, its simplicity is degraded, and more computation is needed in forming vector space. This paper addresses a more accurate approach of feature weight threshold to form the text vector space, which will improve the quality of Chinese text categorization. ②The research of Text association miningDuring the text association mining, documents always own the property of high sparsity, so it will be low efficiency when adopting some traditional association mining algorithm such as Apriori, and will be recursive frequently when adopting Fp-growth, otherwise, you should assign minimal support threshold yourself in traditional association mining, it needs frequentative experiments to get it in text association mining, so it is hard to determine. Aim at these weakness, this paper propose algorithm DL-COFI which combined by compressive structure COFI-tree and dynamic adjust according to the scale of training texts, this algorithm can dynamic determine value L according to the scale of training texts, and utilize COFI algorithm to mine.③The improvement of strategy on rules prune and classificationIt is not enough in prune strategy and categorization prediction as Traditional algorithm CBA, ARC etc. In the aspect of prune strategy, it can't get content conclusion in many algorithms. This paper combines the advantage of two common prune strategy, propose the algorithm of super-rule-J-Measure; In another aspect as prediction categorization, CBA only consider the most suitable one rule, ARC only consider the sum of rules confidence which cover some class documents, in the paper[20], it doesn't consider the influence of confidence and support. So we propose a CDD algorithm, which takes two influence factors into consideration.Finally, compared with traditional algorithms, the proposed algorithm reaches better precise, recall and F1 standard, which improves quality and efficiency of categorization.
Keywords/Search Tags:Text Mining, Feature Vector, Text Association Classification, Association Analysis
PDF Full Text Request
Related items