Font Size: a A A

Research On Classification For Text With Natural Group

Posted on:2014-07-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:M LuFull Text:PDF
GTID:1268330425485894Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text Categorization is a hot research problem in information retrieval and data mining. There are many famous applications of text categorization, such as email spam filtering, classification of blog posts and homepage finding. The target of text categorization is to classify a document into several predefined categories. Nowadays, machine learning is the key technique for automatic text categorization, which focuses on learning a classifier from the training data, and utilizes the learned classifier to make predictions for the test data.In many real-world applications, the data can usually be divided into several groups. The essence of the groups is the physical division of the data. It is also worth noting that the way to group data is intuitive and easily obtained without manual efforts, such as clustering and classifying. For instance, in the task of email spam filtering, the emails can be naturally grouped by email accounts. In the practical applications, the grouped data has the following characteristics:different text characteristic for different groups, different number of samples in each group, and the imbalanced data in some groups. However, the traditional methods ignore the above characteristics of grouped data, which exerts adverse effect on their task performance.To tackle the above problem, the paper conducts research on the naturally grouped data for text categorization, the target of which is to utilize the characteristics of grouped data to boost the classification performance. Note that the paper does not investigate the reason for data naturally grouping, leave alone discussing the issue of selecting the best one among many group attributes. The main contribution of the paper is listed as follows:Firstly, to make use of different data characteristic for different groups, the paper proposes an ensemble algorithm based on cooperative learning for text categorization, referred to as GroupEnsemble. To avoid the ignorance of the inductive information from some groups, the GroupEnsemble incorporates the inductive information from all groups by aggregating the group-specific models learned from all groups. To learn a generalized group-specific model for each group, the paper proposes a co-training method. The key hypothesis of such method is that all group-specific models share some part of model parameters. Under such setting, all group-specific models are learned from not only the group itself but also with the assistance of other groups. After learning these group-specific models, the label of the test data will be predicted by aggregating the probability of the label outputed by all group-specific models.Secondly, by incorporating the group correlation into GroupEnsemble, the paper proposes another ensemble method, referred to as Group-sim-Ensemble. The Group-sim-Ensemble algorithm embodies the group correlation into the loss function to guarantee that similar groups will have similar group-specific models. It seems that a more generalized group-specific will be learned for each group. Since there is no exisiting work to calculate the group correlation in advance, the paper treats the group correlation as parameters, which should be automatically learned from the data. At that moment, the parameters of these group-specific models and group correlation should be simultaneously learned from the same objective function. The paper utilizes the iteration optimization strategy to learn these parameters.Thirdly, to make full use of test data naturally grouping, the paper proposes another approach, referred to as Group-test-Ensemble. For each test data, the proposed algorithm assigns the most suitable combinational coefficients to aggregate these group-specific models. In order to assign large combinatiaonal weights for the similar groups of test data, the paper models the task of learning combinational coefficients as a ranking problem of emphasizing the top ranking order. The formulated ranking problem is solved by the cost-sensitive listwise ranking approach.To evaluate the effectiveness of the proposed algorithm, the proposed algorithm is applied to many text categorization tasks, such as email spam filtering, homepage finding and document retrieval. Experimental results on several benchmark datasets indicate that the proposed algorithm outperformed the baselines; especially the approaches which do not consider the characteristics of grouped data Apart from text categorization, our approach can be also applied to other tasks with grouped data.
Keywords/Search Tags:Text categorization, naturally grouped data, cooperative learning, groupcorrelation, ensemble learning
PDF Full Text Request
Related items