Font Size: a A A

Research On Text Classification Algorithms Based On Machine Learning

Posted on:2018-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:P X DengFull Text:PDF
GTID:2348330518496527Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
As the time of big data is coming, more and more data on the Internet has been becoming a great value. Unstructured data represented by text,can serve as datasets for all kinds of data mining tasks, such as user portrait and public opinion detection, while can help provide plentiful content,express feelings,. and share experience. Text classification, as the foundational task in the field of Nature Language Processing (NLP), not only can help automatically select information and accelerate the information process, but also can serve for complex tasks such as sentiment analysis, automatic summary, and human-computer dialogue, which provide users with intelligent and personalized service. According to the number of labeled data in train set, text classification can be divided into supervised text classification and semi-supervised text classification. And,the research of semi-supervised text classification is lack. Therefore, to further improve the accuracy of text categorization, and to solve text categorization problem under complex scenes is a hot topic of in the field of NLP.In the task of supervised text classification, sufficient labeled samples can be used to train complex models to achieve better performance.Compared with the shallow learning model, the neural network model has strong ability of feature extraction and modeling of complex problems. The dimension of features produced from traditional text representation model is not high enough to fully train the deep neural network. In contrast, word-embeddings model can help transform text into two-dimensional grid data and is suitable for convolution processing, which carry semantic and syntactic rules. Besides, convolution specializes in dealing with spatial relations, which makes it possible to extract context and structure rules automatically. Therefore, CNN greatly improved performance in supervised text classification. In addition, we proposed to employ neural networks with different structures to capture useful feature from text in different lengths, and further improved the accuracy of classification.As for semi-supervised text classification tasks, the lack of labeled data always leads to unfitting or over-fitting in supervised classification model. Co-training, based on the differentiated feature space, has achieved good results with the use of supervised classifiers. However, the way to find the two view from content to meet the conditions of full redundancy and conditional independence is the difficulty of text co-training. In this paper, two different feature spaces are constructed from different text representation models, which are based on different points and ways. As global/detail views of co-training, the particularity for scenes in the existed models are solved. On this basis, an improved co-training algorithm by employing multiple under-sampling for unbalanced dataset is also presented. The experimental results show that the proposed co-training model is superior to semi-supervised text classification.
Keywords/Search Tags:text classification, Convolutional Neural Network, word embeddings, co-training, Semi-Supervised Learning
PDF Full Text Request
Related items