Font Size: a A A

Study On Multi-classification Method Of Chinese Agricultural Web Pages

Posted on:2013-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:S S WangFull Text:PDF
GTID:2248330395965770Subject:Agricultural mechanization project
Abstract/Summary:PDF Full Text Request
With the rapid development of IT and the Internet popularity, the construction and services and levels of agricultural information have got a great promotion and enhancement. The mass and disorder and complex of agricultural information on Internet has brought convenience for agricultural employees and increased the difficulty to obtain effective information. How to classify and manage these agricultural information effectively, which makes it easy for farmers to get the actual information they need timely and accurately, becomes the important issue need to be studied in the area of agricultural informatization. Works achieved in this Paper are as follow:1)Studying the key technologies of text classification deeply:text pre-processing,chinese word segmentation,feature selection algorithm,feature weighting algorithm,machine learning algorithm and classification evaluation standard, use agriculture web pages as experiment corpus, we seriously studied text multi-classification technology, feature selection, feature weighting and machine learning.2)Defined classification standard for chinese agricultural web pages, builded the webpage corpus of chinese agricultural sites. From five categories:navigation pages,news pages,policies and regulations pages,science and technology pages and market information pages, chosen1000web pages respectively and randomly, a total of5000pages as training sample set of classification experiments. And chosen500web pages,a total of2500pages as test sample set of classification experiments.3)Use HTMLParser pre-processing the sample set of web pages, paoding-analysis for segmentation processing and removing the stop words processing, chi-square statistic algorithm for feature selection, chosen300words with biggest score in each class as experiments’ features, boolean weighting,word frequency weighting and TFIDF weighting for weighting selected features, four machine learning methods: multiple linear regression,Naive Bayes,K nearest neighbor and support vector machine for supervised learning, based on three different weightings of the feature vector space, finally obtained12multi-classification models for chinese agricultural web pages.4)Based on the same machine learning method and different feature weightings, comparative analysis of precision,recall and F1measure, from results of backtesting and forecasting the12group of multi-classification models. The results show that there’s no feature weighting method has absolute advantages, in different machine learning classification models, each has its own advantages and disadvantages. Based on the same feature weighting and different machine learning methods, the results show that:K nearest neighbor algorithm own the best learning ability(backtest), combined with word frequency weighting,precision,recall and F1measure all evaluation indicators have reached100%; while support vector machine own the best generalization ability(forecast), combination of boolean weighting, precision,recall and F1measure all evaluation indicators can reach about99%.In conclusion,this paper based on5000training set and2500testing set of chinese agriculture web pages, analyzed and compared the feature weighting algorithms and machine learning algorithms by experiments. The results indicated that:classifier based on support vector machine combined with boolean weighting achieved high performance in multi-classification of agriculture web pages corpus. When backtested training sample set,precision、recall and F1measure all indicators could reach about99.9%,when forecasted testing sample set,precision,recall and F1measure all indicators could reach about99%.
Keywords/Search Tags:chinese agricultural pages, text multi-classification, feature selection, feature weighting, machine learning, support vector machine, F1measure
PDF Full Text Request
Related items