Font Size: a A A

Feature Coupling Generalization And Its Application In Text Mining

Posted on:2012-11-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y P LiFull Text:PDF
GTID:1118330365485882Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text mining aims to automatically extract knowledge form plain text in natural language, which can help people to find useful information from large text corpora accurately and efficiently. With the rapid development of information science and World Wide Web, text mining becomes more and more useful in practice. In this area, the techniques based on supervised machine learning have been used with great success and achieve good results in a lot of experiments. Feature representation is one of the most important issues in machine learning, which has big impact on the performance of learning systems. However, in traditional supervised learning method for text mining, the limited amount of training data can lead to serious data sparseness problem in feature space, where a lot of low-frequency features cannot be utilized well due to insufficient information available. Addressing this problem, I develop a method that aims to convert these ignored features to effective ones so as to improve the performance of classification. I propose Feature Coupling Generalization (FCG) framework for creating new features from raw features based on feature co-occurrences in a large amount of unlabeled data and the concept hierarchy of raw features. The new features lead to a more informative and general representation than the raw features. In this thesis, I discuss various factors that influence the performance of FCG and examine its performance in text mining tasks.In this work, FCG is applied to three text mining tasks:named entity recognition, relation extraction and text classification. In each task, I investigate the performance of classical features the features derived from FCG, examine the contribution of FCG and whether it overcomes the problem of data sparseness. The experimental results show that FCG can utilize well the features ignored by supervised learning and improve the performance of classical methods. In all tasks, FCG can utilize huge amount of unlabeled to generate new features, which is one of the advantages over other semi-supervised learning methods. Interestingly, I find that the individual performance of new features generated by FCG is at least as well as the classical features widely used in these tasks, which indicates FCG provides an alternative way for feature representation in machine learning. The results also show that the system based on FCG achieves state-of-the-art performance on public challenge datasets.
Keywords/Search Tags:Text Mining, Machine Learning, Feature, Named Entity Recognition, Relation Extraction, Text Classification
PDF Full Text Request
Related items