Font Size: a A A

Text Mining Technology And Its Application In The Integrated Risk Information Network

Posted on:2012-12-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:1118330332994115Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With rapid development of Internet technology and the exponential growth of electronic text information, how to find the useful knowledge from large amount of data becomes an important topic of data mining. This thesis is based on the National Science and Technology Planning Project of "11th Five-year" Plan which is named "Key technology research and demonstration of Integrated Risk Guardians (No.2006BAD20B02)". According to complete intelligent acquisition and classification of Integrated Risk Information, some key technologies of text mining, such as representation model, feature selection, text classification and text association have been studied. Based on that, some exploratory researches are carried out considering the features of Integrated Risk Information. The main contributions are summarized as follows:1. The representation model of integrated risk information is proposed. The tf~*idf weighted method based on the space vector model is analyzed first, and then, by ignoring the shortage of distribution information among classes, considering the Integrated Risk Information as web information, a weighted method of the integrated risk information is proposed, which comprehensively considers the feature items frequency, inverse document frequency, category weight of feature items and HTML tags. Experiments show that this method can improve the performance of text categorzation.2. A text feature selection based on ReliefF algorithm and RMI evaluation function is proposed. Aiming at the problem that those traditional feature selection methods of text mining neglect the relevance between features, which leads to massive problems of redundant features in the feature subsets, a combined method of text feature selection is designed. First, irrelevant features are removed by ReliefF algorithm, and then redundant features are filtered by RMI evaluation function. Experiments show that this method can remove the redundant features of text more effectively compared with the traditional ones.3. A text classifier based on confidence attribute bagging is is proposed. Aiming at the problem that weaker classifiers of Bagging have the same weights, an improved Bagging algorithm is developed. This algorithm gains more training sets by re-sampling the attributes of the samples. The classified weights can be calculated from each weaker classifier which is based on kNN. The ensemble classification results can be achieved based on voting rules. The classifiers ensemble results which is based on voting rules. The algorithm is used to design a text classifier, which is better than Attribute Bagging algorithm.4. A key-phrase extraction method based on gray associate analysis is proposed. Gray associate between given key-phrase and feature words is worked out by which key-phrase is extraction. The main advantage of this method is that it can be equally applicable for large and small quantity of samples and ignore whether the sample is regular. So it can sovle the problem that the key-phrase extraction methods using mathematical statistics ignore the contribution of low-frequency professional words.5. The proposed algorithms are adopted to Integrated Risk Information Network. Based on the technology of focused crawler, the intelligent collection and classification of Integrated Risk Information is implemented and achieves better performance.
Keywords/Search Tags:Integrated Risk Information, Representation model, Feature selection, Classifier, Key-phrase extraction
PDF Full Text Request
Related items