Font Size: a A A

Semi-supervised Text Categorization Technology Research Based On The Semantic Analysis

Posted on:2018-10-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z W XuFull Text:PDF
GTID:2348330563952360Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Since the 21 st century,the rapid development of the Internet brings us a variety of text information,and how to quickly grasp the most effective information becomes very important.Text classification based on artificial intelligence technology can classify a large number of natural language documents into corresponding subject categories according to its correct semantics.It is helpful for us to grasp the key information.Text classification technology used to classify text is mainly based on the trained classifier.By learning from a lot of hand-labeled documents,machine learning algorithms can produce a traditional text classifier to implement Text Categorization(TC).However,there are some obvious shortcomings in this leaning way because labeling all data by human is labor intensive and time consuming.To solve this problem,a semi-supervised learning method is proposed since it only need manually label part of documents.But this way cannot solve the problem completely because it is unfeasible for classifying great number of Web data.To solve the problem of manual labeling completely,a supervised text categorization approach based on automatic tagging is proposed by us.In the process of automatic tagging,we implement automatic tagging for all the original documents by using the semantic similarity between the category name and the document content.And it is based on the external semantic resources for semantic extension of category name.In the process of text classification,firstly,we conduct document preprocessing by word segmentation technology and stop words table.Secondly,through the calculation of CHI value,we can implement document feature selection,and then weight the feature words.Finally,we take the numerical training data for supervised learning based on machine learning algorithm,and then implement text classification.Experiments show that the text categorization method based on automatic tagging without manual annotation data can achieve the large-scale text classification,Supervised text classifier based on automatic tagging need label all documents,and this will bring noise into experiment and cause the result cannot meet the accuracy requirement.In order to solve the problem of the low accuracy,we put forward the semi-supervised text classification technology based on the improved automatic tagging.There are two improved aspects in our approach.On the one hand,we improve the automatic tagging by combine other external semantic resources with original documents itself on category name extension,then,we refine the initial labeling result by secondary filter algorithm to ensure that part of the training documents be accurately labeled.On the other hand,we take a semi-supervised learning styles to implement text classification,and then we only need to combine the labeled documents and a large number of unlabeled documents as training data,and finally obtain a text classifier with high classification precision.The experimental results show that the improved semisupervised text categorization technology effectively avoids the problem of noise bring by automatic tagging,and the average classification accuracy of classification is higher than other supervision model.It proves the commercial value of semi-supervised algorithms in automatic text categorization.
Keywords/Search Tags:text categorization, semi-supervised, automatic tagging, classifier
PDF Full Text Request
Related items