Font Size: a A A

Studies On Key Techniques Of Text Classification And Mining For Specific Domains

Posted on:2010-12-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:M H HuFull Text:PDF
GTID:1118360302477433Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The large amount of text information is stored and presented in the form of electronic texts, and the text information in large quantity needs to be organized and managed effectively and efficiently. Therefore, the studies in the field of text classification have caught more and more attention, and a lot of major breakthroughs have been achieved recently, some of which have been applied to many different fields. This dissertation investigates the key techniques of text classification, analyzes the issues in government document classification, text sentiment analysis and patent mining, applies the text classification techniques to these fields, and presents the corresponding proposals, which are supported by the large amount of experimental data. The major work is summarized as follows.(1) The assumption that the words are mutually independent of each other has been widely applied in the field of text processing. Although it can greatly simplify the text processing, it does not reflect the truth in most of the cases. This dissertation, for the very first time, applies the independent component analysis (ICA) techniques to the field of text classification, extracts the independent features for text classification and resolves the stability and low converging speed issues caused by the high dimension of the feature space and the sparse data in a text. Combining the ICA techniques and the traditional feature selection methods has lead to the significant performance improvement when applied to text classification, as demonstrated by the experimental result data.(2) Most of the government documents have the listed keywords, where the keywords carry the large amount of the category information for the text, which, of course, should be fully utilized for text classification. This dissertation employs the SKG model, presents the document text with the conditional probability of the key word space and, as a result, lowers the dimensionality of the text feature space. For the issue in short of key words, this dissertation extends the key words set of the government documents by establishing a model, the KWB model, to automatically acquire the relevant words of key words according to the Bootstrapping learning frame. The experiments have shown that this method can fully utilize the classification information carried by these key words and improve the classification performance, as a result. (3) The framework using three kinds of training data to train classifier are proposed to public improve the accuracy of subjective sentence classification. The experiments performed on the public MPQA corpus have shown that this framework can effectively improve the accuracy of subjective sentence judgment, even in the cases where there are very few indirect subjective sentences in the corpus. Furthermore, one analysis technique based on the weakly supervised learning is proposed to address the issue of insufficient training data and the issue in multi-entitysentiment analysis, and to realize the entity feature identification at and multi-entity polarity analysis at the sentence level. The experiments have shown that the accuracy is quite acceptable.(4) The kNN classifier is concluded to be better for patent mining tasks after the training data of patent mining in NTCIR-7 and key issues have been intensively studied. The penalty factor added in Ranking method is proposed to deal with the extremely imbalanced samples distribution issue. Many similarity calculation methods have been studied, and some Ranking decision-making methods such as Weak, NVote, WeakAver, are proposed or improved. The system performance has been greatly improved when the Log-linear and Rank-SVM models in machine learning are applied to fuse a few systems to get the last results list. The system performance ranks No. 1 in NTCIR-7 evaluation testing.In summary, this dissertation presents a serial of new text classification methods, which are supported by the experiment results, by having done the intensive theoretical analysis and performed the large quantity of experiments for the key issues in text classification such as feature extraction, classifier fusion, and imbalance distributed samples. The algorithms and models presented in this dissertation will be valuable for future studies in text classification and other fields in text processing.
Keywords/Search Tags:text classification, independent feature extraction, Bootstrapping, text feature space conversion, entities identification and polarity analysis, patent mining
PDF Full Text Request
Related items