Font Size: a A A

Research On Semi-supervised Short Text Classification Based On Co-operative Training

Posted on:2018-09-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z H HanFull Text:PDF
GTID:2348330536473564Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,information is growing exponentially.Through the Internet people can easily get a lot of information,and thus their own behavior plays a very important guiding role.Short text is a very important information carrier in the Internet.The information contained in the short text is obtained directly by manual marking.However,the manual marking method requires a large number of professional and technical personnel to participate in,consume a lot of manpower and material resources,and Only a small amount of text can be marked,and the number of texts on the Internet is very large,so the way of manual marking is not suitable for the classification of large-scale text on the Internet.The method of machine learning is used to label the unlabeled samples and become a trend of text information processing on the Internet.At the same time,improving the efficiency of sample labeling has become the hotspot of current research.The method of machine learning is used to label the unlabeled samples and become a trend of text information processing on the Internet.At the same time,improving the efficiency of sample labeling has become the hotspot of current research.This paper mainly studies the semi-supervised short text classification based on collaborative training,which mainly includes the following aspects:1.This paper analyzes the classification of short text and gives a semi-supervised short text classification system model based on collaborative training.Short text classification model can be divided into three functional modules: preprocessing module,training module and test module.The pretreatment module is mainly on unstructured text processing,based on short text formatting tags,word segmentation,removing stop words,feature extraction,word frequency statistics,text vector and a series of steps to get the structured data sets.Training module,on the one hand,constructs the classifier according to the principle of difference,and uses the classifier to label the unlabeled sample.On the other hand,the training sample set is used to train the classifier to obtain the classifier.The test module is used to test the classifier using the test sample set to verify the feasibility and effectiveness of the cooperative training method.2.Combined with semi-supervised co-training,the short text classification method is given,and the feature extraction method and cooperative training method are further improved.(1)Improvement of Feature Extraction Method.According to the characteristics of the number of short texts in the short text,the adjacency matrix between the words in the short text is constructed from the perspective of the semantic relation between the words,and then an undirected graph is constructed by the calculation of the similarity of the adjacency matrix.The feature is calculated from the adjacency of the graph,and the feature word with high feature is extracted.This kind of feature extraction method can effectively classify the short text compared with the traditional method,which takes into account the similarity between the semantics of words.(2)Improvement of Collaborative Training Algorithm.In order to annotate the unlabeled samples,the classifier is trained by the multi-classifier "mutual aid".In the case of two classification problems,if the labeling results of the three classifiers are the same,the higher the confidence level is marked on the label,and the sample is placed in the labeled sample set.If the label results are different,then there must be two classifiers marked the same results,using the results of the two classifiers training the third classifier.In the labeling process,repeated training classifier,and ultimately get better performance classifier.3.The validity of the method of semi-supervised short-text classification is verified by comparing the short text collected from the Internet website.Through the selection of short text posts collected by Sina,Sohu and Netease and other major websites as the data set,this paper compares the improved method with the traditional short text classification method by evaluating the accuracy rate,recall rate and F1 value This paper evaluates the classification method to verify the feasibility and effectiveness of this method.Therefore,this paper constructs a semi-supervised short text classification model based on co-training,and gives the corresponding classification method.At the same time,the short text feature extraction method and semi-supervised cooperative training are improved,and the improved method is compared with the traditional method the contrast experiment.The experimental results show that the proposed method can effectively improve the efficiency of short text classification.
Keywords/Search Tags:Semi-supervised learning, Co-training, Annotated, Short-text classification, Classifier
PDF Full Text Request
Related items