Font Size: a A A

The Research And Application Of Text Categorization Based On Machine Learning

Posted on:2016-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2308330473954430Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Information technology fast develops in recent years, especially usage of network increased dramatically and the text information has been more and more. people can’t simply rely on manual labor to achieve high efficiency access to vast amounts of information in key content. To solve this problem, text classification method based on machine learning for people to know the beginning and gradually become popular trend.The main contents are as follows:1. The thesis presents a algorithms, namely the concept of index and principal component analysis algorithm, which can effectively reduce spatial dimension. The algorithm calculates the prototype vector of text and inner product of the original text vector and the prototype vector, then the original text vector is projected to a subspace, which will greatly reduce the dimension of the original space. The algorithm separately calculated covariance matrix for each internal document and get its eigenvalues and eigenvectors, each vector will be transferred to the new sub-space. By combining the two techniques to achieve the goal of dimensionality reduction without affecting the accuracy of the classification.2. The thesis proposed a text learning algorithm based on contextual. The core of the algorithm is divided into a training set and contextual learning classification categories. Classification training set mainly based on headings and give each class a corresponding index. It calculated the feature weighting for each category of all documents and score feature words iteratively. Contextual learning classification extract feature words through rules mining algorithm firstly, constitute matrix of contextual features characteristic words. All the values of the matrix are reference of score values. and the scores of reference values represent the importance of context for feature words. for each feature word, we calculate the sum of all references score of contexts. If the contextual reference have the highest score value, its contextual will be set to Context of the input text. The algorithm can learn all classifications within a document classification at once with combine traditional statistical analysis and context analysis.3. The thesis also gives the corresponding experimental results and analysis based on algorithm analysis in detail. In this paper, we use the five classic dataset asexperimental subjects, each data set contain more than a thousand pieces of data. On different datasets, algorithms and classical efficient algorithms proposed a detailed comparative evaluation algorithm performance. Experimental results show that: the two algorithms are able to efficiently classify text, and have a strong practical.Above learning algorithms presented in this paper deal with the training set from a different angle, the former through dimensionality reduction, while the latter sort by score. They can learn to reduce costs and improve classification accuracy. You can clearly get the advantage through the simulation in the various types of data sets, especially the high complexity of data set. Those algorithms in this paper is much better than the existing efficient algorithms. Finally, we summarize the research and implementation of the proposed two algorithms and find its improvement later.
Keywords/Search Tags:machine learning, text classification, feature extraction, dimensionality reduction, concept index, principal component analysis, contextual learning algorithm
PDF Full Text Request
Related items