Font Size: a A A

Implementation And Application Of An Effective Text Categorization Method Named MDCC

Posted on:2019-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:J Q QiaoFull Text:PDF
GTID:2348330542497635Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Massive information resources exist in the form of text.How to quickly get users interested content in massive information is a problem that must be solved in information processing under the current Internet environment.Text Categorization(TC)is one of the effective ways to analyze a large amount of text information.TC technology takes computer as a tool and learns automatically by machine,so that the computer has the function of automatic classification of text.When any text inputs,the computer can automatically classify the text into a certain category according to the knowledge it has already mastered,thus helping the user to locate the topic information of interest to the user more quickly.Documents were usually expressed by Vector Space Model.Construction of feature vector using the words in the document as classification features in the model.Because the document contains a large number of words,many words do not contribute much to text classification,and the problem of "dimensionality disaster"can be caused if all of them are remained.It is necessary to solve the problem of high dimension of feature vector by feature selection.TF-IDF,information gain,chi-square checking and mutual information are the commonly used classical feature selection algorithms.The traditional method of feature selection has some deficiencies in text classification process.For example,the TF-IDF algorithm cannot combine the feature word and the category information,while the information gain and the chi-square verification method ignore the semantic information of the feature word in the text.These deficiencies of these feature selection will become the influence factor of classification performance.This dissertation analyzes and compares the characteristics of many classical text feature selection methods.We propose a text classification method based on Max Difference Category Contribution(MDCC)by combining the category features and semantic features of feature words.MDCC also considers the relationship between feature words and multiple categories.The method calculates the word weight according to the word frequency of the feature word in the text and the maximum difference value in different categories.At the same time,MDCC optimizes the feature representation by combining the relationship between feature words and different categories.The main work of this dissertation is as follows.Proposed a text classification method based on maximizing the difference and contribution of categories.In this dissertation,the maximum difference is applied to the selection of text feature words,and the category contribution model is established according to the relationship between words and categories.The method chooses the word which maximizes the largest difference value and the most semantic feature as the key word.In the process of text feature representation,the categories of feature words are calculated according to the importance of words in different categories Contribution Degree Vector.Finally,the feature vectors of the feature words in the text are accumulated to obtain the text feature vectors for classification.A comparative experiment on the three publicly available corpora 20 Newsgroups,Reuters and WebKb showed that this method has a significant increase in both MircoF1 and MarcoF1 values.This dissertation develops and implements a college topic comment system based on multi-source data,and verifies the effectiveness of the proposed text categorization method with a concrete example.The system mainly achieves the functions of automatic generation of highly efficient topic tags,emotional sentiment analysis and topic category decision by combining feature word selection with maximum difference(MD)with other text categorization methods.Among them,the automatic generation of university topic labels is realized by using the maximum difference and the algorithm of TF-IDF;the category judgment of the topic information is directly implemented by the MDCC algorithm;the comment tendency analysis function uses the MD algorithm to select the feature to construct the feature vector,Classifier for emotional decision to achieve.Based on the MDCC algorithm,the whole system mines the university topic information,and effective real-time,intuitive displays of relevant topics in colleges and universities information.
Keywords/Search Tags:text mining, topic classification, emotion classification, tag generation, college topic comment system
PDF Full Text Request
Related items