Font Size: a A A

Research On Key Techniques And Applications In Text Classification

Posted on:2016-04-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:L Z FengFull Text:PDF
GTID:1108330482454740Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
On the basis of the development of Internet technology, the number of the electronic text documents is increasing. If the text documents are organized and managed manually only, then it not only consumes a lot of manpower and time, but also is difficult to achieve. On this basis, automatic text classification becomes particularly important as a key technology of text data precessing and a core technology in organizing and processing text data as the basic function of text information mining. Therefore, it has attracted the extensive attention of scholars and also has a very broad application prospects.So far, text classification technology is widely used in many fields, such as information filtering, spam classification, search engine, query intent prediction, topic detection and tracking, text corpus build, and so on. Through text classification technology users can easily locate the information they need, and solve the problem of messy data classification. Therefore, the automatic text classification technology is more and more closely related to people’s work and life. On this basis, how to improve text classification accuracy while meeting personalized requirements is the new challenge that the automatic text classification method faces.In this paper, we studied the basic knowledge and relevant technologies of text classification, and analyzed the hot issues exist in the current text classification. In order to improve the speed, accuracy and personalization of text classification, firstly, we proposed a feature selection method to reduce feature space dimensionality; then for several typical applications of text classification, we use the user interest information obtained by analysis and mining to achieve the spam classification, gender classification and query intention classification. In this paper, the research and innovation work includes four points as following:1. A novel feature selection method based on random walk and artificial bee colonyA random walk algorithm based feature selection method(called RWFS) is proposed in this paper, which can reduce the dimension of feature space without sacrificing the performance of the classifier. As many classifiers cannot deal with the features with large dimensions, the noisy, irrelevant and redundant information must be filtered from the original feature space. Firstly, an optimal feature selection method(called OPFS) is used to select some features from the training set. Secondly, the redundant features are filtered by combining the random walk algorithm and a pre-determined threshold. Moreover, in order to search the optimal threshold, an improved artificial bee colony method(called IMABC) is proposed for parameter optimization. In the experiments, classifiers are used on four corpuses: mini news group, 20-Newsgroups, Reuters-21578 and Web KB. The experimental results show that, the proposed method is superior to six typical feature selections, and can greatly reduce the dimension of vector space while guaranteeing the classification accuracy as measured by F1 measurement.2. Spam classification method based on active and incremental learningIn order to meet the personalized need of users, we studied a typical problem of binary text classification, spam identification. In order to achieve the purpose of improving the classification speed and ensuring the personalized needs without sacrificing the email classification accuracy seriously, the conceptions of term frequency based interest sets are introduced. Emails are classified by combining term frequency based interest sets and Na?ve Bayes classifier, and a novel boundary density based email classification certainty evaluation method is proposed to select the emails, and then recommend the eamils to users for labeling according to the active learning theory. Based on the incremental learning theory, the emails which are labeled and classified with the greatest possibilities are used for retraining. We carried out experiments on two common corpuses: Trec2007 and Enron-spam. Comparing with six typical active learning based incremental learning methods, the proposed method greatly reduces the consuming time of email classification while guaranteeing the accuracy.3. A novel clustering based gender classification method for text authorIn view of the problems that obtaining the training samples in the field of text classification is always difficult, and the burden of manually labeling is great, a novel clustering based gender classification method for text author is proposed in this paper. Firstly, the unlabeled sample set is clustered and the cluster centroids are labeled, and the clustering certainty factor is proposed, the samples which are clustered most uncertainly are obtained by combining the cluster radius information and recommended to experts for labeling. Secondly, the document structural features, document content features and author interest features are used to represent samples. Finally, the sequential minimal optimization algorithm is used for training the samples and identifying the author gender of a new sample. Comparative experiments indicate that, the problem that the categories of the boundary samples in different clusters are uncertain can be solved well by the proposed clustering certainty evaluating method. Moreover, the use of author interest features which combine with hypernym can enhance the accuracy of gender classification significantly.4. A novel query intent identification method based on user interestAfter studying the basic theories and applications of text classification field, text classification technologies were applied to Web text classification. We proposed a novel query intent identification method based on user interest model, it can identify the user’s query intention and realize personalized and intelligent retrieval by mining user interest. Firstly, the initial category set is defined as the user interest categories set which is pre-determined by combining the Open Directory Project(ODP). Secondly, on the basis of the web pages classification which browsed by the user, user interest degrees are calculated and user interest models are constructed. At the same time, the search log corresponding to a query is clustered to get all sub-intents. Finally, the optimal query intent is extracted from the sub-intents by using the user interest model. Comparative experiments show that, the user interest model we proposed can identify the user interest accurately and can distinguish the user preferences for different interest category. Moreover, the algorithm can be applied to intent identification effectively and is more personalized.
Keywords/Search Tags:Text classification, Feature selection, Spam classification, Gender classification, Query intent identification, Active learning, User interest
PDF Full Text Request
Related items