Font Size: a A A

Research On Web Text Mining Based For Multi-instance Multi-label Classification

Posted on:2018-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:L H WangFull Text:PDF
GTID:2348330536957361Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology,the rapid growth of Internet information resources,the classification of large amounts of data put forward further requirements.It is an urgent task to research how to express the text effectively and find the information effectively.Nowadays,the research of text mining is becoming more and more important.In real life,many multi-label text exist,which poses new challenges to text classification research.Traditional text categorization is a single-instance single-label classification,which can not deal with multi-semantic and multi-category text accurately.In this paper,multi-instance multi-label learning is proposed to classify multi-label text accurately and effectively.This paper mainly studies the following aspects:(1)The multi-instance multi-label learning framework is used to classify Chinese text.Multi-instance learning and multi-label learning are proposed for semantic ambiguity and multi-class learning respectively.The multi-instance multi-label learning(MIML)mainly focuses on the research of image classification and web search.It has achieved many good achievements.In this paper,the multi-instance multi-label learning(MIML)method is applied to the Chinese text classification,and the MIML learning framework is improved to fit the Chinese text classification.To direct at the special structure of Chinese and the MIML method propose a new idea to Chinese test classification..(2)Text representation as a key step in text classification has a great influence on the learning performance of the following classifiers.In this paper,a Bag of sentences package is used to express the text in the light of the semantic richness of Chinese text.At present,VSM is the representation method of the mainstream text.This method regards the word sa a text segmentation granularity,it makes the assumption of the in dependence to the characteristics,that the semantic information between words lost.In this paper,a multi-sample text representation is introduced to solve the problem of semantic deletion.The text is processed using multiple example packages,and sentences are used as the smallest unit of text representation,so semantic information between words is preserved.The data representation phase uses the multi-sample sentence packet to express the text,avoids the semantic loss caused by the semantic independence hypothesis,and further optimize it to become the theme package.It shortens the time of text processing.(3)In the text classification stage,the improved LSTSVM multi-label classifier is used to classify.For the text represented by multi-instance theme packages,multi-instance multi-tag data is clustered into single-instance multi-label learning based on degenerate strategy.Classification of text using improved least squares twin support vector machine(LSTSVM)multi-label classifier.LS-SVM transforms a large-scale QPP problem into two small-scale QPP problems.It makes computational speed is improved and the computational complexity is reduced.(4)According to the improved algorithm design,the multi-example multi-tag text classification system is constructed.The improved algorithm is validated and analyzed by reuter-21578 news corpus,Emotion data set and Chinese corpus data set of Tong ji University.Experimental results show that the improved algorithm is superior to the existing multi-label classification algorithm in evaluation index.
Keywords/Search Tags:multi-instance learning, least square twin support vector machine, text classification, multi-label classification
PDF Full Text Request
Related items