Font Size: a A A

Research On The Essential Technology Of Multi-Label Chinese Text Classification

Posted on:2019-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:L DuFull Text:PDF
GTID:2428330548476374Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid popularization and development of e-commerce,the amount of commodity data on the Internet has exploded.The user's comments on the product not only serve as an important reference for other users to evaluate the quality of the product before they are consumed,but also can understand the status of commodities based on these feedback comments,so that businesses can improve and enhance their products.How to find the required information quickly,accurately and comprehensively from the voluminous commentary data has become an important research field in the computer field.Because a product review may include several aspects of the information,the review is multi-label data.Multi-label text classification is one of the core technologies in the field of data mining,which provides convenience for information classification,retrieval and management,and has important research value.This paper mainly studies several techniques in the process of multi-label Chinese text classification,including multi-label text preprocessing,pre-classification,text representation,feature selection algorithm and multi-label classification algorithm.Based on the above process,this paper also focuses on two key techniques of word representation technique and multi-label classification algorithm,and puts forward a novel OCWE model of word representation and improved multi-label classification algorithm w ML-kNN algorithm.Word representation is a very important step in the classification of multi-label texts and is now a basic work in natural language processing.This paper makes a systematic analysis of the existing word representation technology principles.Considering that the existing Chinese presentation technology is always following the English way of thinking,this paper proposes a new Chinese text representation model OCWE for the characteristics of Chinese texts combined with the research of other scholars.Specifically,the method is based on the CBOW model,taking word order information into consideration,and joining the context vectors into the input of the model.Meanwhile,considering the characteristics of the Chinese text,the input word vectors are composed of the word itself and the vectors of all the words forming the word.Experiments show that the word vectors,which takes word order and word composition into account,have a significant improvement over the word vectors without considering word order and word composition.Multi-label classification algorithm ML-kNN algorithm is evolved from the famous kNN algorithm.It combines the kNN algorithm and the Bayesian algorithm to learn the classifier,which effectively classifies the multi-label data.However,the ML-kNN algorithm is prone to misjudgment or incomplete judgment of the unseen instance's label set in the case that the number of labels contained in the training instances is not balanced or the spatial distribution of training instances with various types of labels is not uniform.In this case,the performance is not very good.Therefore,this paper proposes a weighted ML-kNN algorithm(i.e.w ML-kNN algorithm),which gave different weights on each label respectively according to the proportion of the amount of the various labels in the training instances' label set and the mutual information of the spatial distribution of unseen instance to training instances.Through this method,it can reduce the probability of misjudgment of unseen instance's label set.The results show that the performance of the w ML-kNN algorithm is better than the four multi-label learning algorithms,including ML-kNN.
Keywords/Search Tags:Text classification, text representation, word embedding, ML-kNN, multi-label
PDF Full Text Request
Related items