Font Size: a A A

Design And Implementation Of Text Classifier For Enterprise Technology Requirement

Posted on:2018-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:L P ZhangFull Text:PDF
GTID:2348330542951904Subject:Engineering
Abstract/Summary:PDF Full Text Request
Numerous middle-scale and small-scale enterprises have an important contribution to China's economic development.But the low level of technical staff,the lack of innovation and other issues lead to short life of these enterprises.Some advanced research results in universities have not been applied in our country,and the corresponding social benefits have not been produced.For this phenomenon,the members of our laboratory plan to establish a network platform which can promote the cooperation between schools and enterprises,so that enterprises can get technical support from experts in related fields.In this thesis,a text classifier for the technical requirements of the enterprises was designed and implemented.The technical requirements of enterprises are divided into the categories of the first-order discipline of Engineering,which can be one of the basis of the recommendation function of this network platform.Text classification,as an important technology in the field of Natural Language Processing,has gradually become the focus of research.At present,there have been a lot of research results about the technology of text classification,but most of the research and improvement are aimed at Chinese word segmentation technology and classification algorithm,not the feature extraction technology.So,the feature extraction algorithm is taken as the main research point,and two improved LDA based feature extraction algorithms are proposed in the thesis,which can reduce the dimension of the space vector and get better classification results.The research background of this topic is special,the existing text classifier and the classification corpus(data set)do not meet the above requirements and this is a huge challenge for this project.The main work is described as following:(1)Abstracts of dissertations of Wanfang database was obtained by the web crawler,constructing the classification experiment corpus which is in accordance with the classification requirements in the background of the subject.The standard classification corpus(Sogou news corpus)and self built corpus will be used in comparative experiments,in order to verify the improved LDA feature extraction algorithms are universal.(2)Two word segmentation systerms,ICTCLAS and JIEBA,were uesed in experiments on corpus text,JIEBA was choosed to complete word segmentation of classifier according to the size of the word segmentation results,performance of word segmentation was tested finally.(3)In order to achieve better classification results,the LDA topic model is applied to the feature extraction stage(LDA),and two new feature selection methods based on LDA topic model(LDA_SD and LDA_WORD)are proposed in this thesis.Two kinds of feature selection methods,MI and DF,are implemented and compared with above feature extraction methods.(4)In this thesis,we compare the classification results based on the results of different feature extraction methods of three classification algorithms which are KNN,NB and SVM,so that the classification algorithm with the best classification results can be chosen to realize the classifier.The text classifier for technology requirements of enterprise was designed and implemented,the experimental results show that the classifier has excellent classification effect,but the practical application needs further verification.According to experimental results,the feature extraction method LDA has the best feature dimension reduction performance,the classification efficiency of it is very high,and the classification accuracy is relatively poor;the improved feature extraction method based on LDA,LDA_WORD,has the highest classification accuracy.These two feature extraction methods have their own advantages and can be applied to different occasions.
Keywords/Search Tags:text categorization, feature extraction, topic model
PDF Full Text Request
Related items