Font Size: a A A

Reasearch On Filtering Method About Garbage Webpages In The Agriculture Websites

Posted on:2012-07-12Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ZhangFull Text:PDF
GTID:2178330335986020Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Xinjiang is situated in the country's northwest region,which has vast territory and a lot of resources,but it is precisely because its vast,objectively lead to far distance between rural areas, between urban and rural areas,which indirectly lead to poor information in rural areas and hinder development of productivity in rural areas. It is clear that the information construction in rural areas of Xinjiang is imminent, the peasants urgently need modern means of information to get real-time information and grasp market trends. In the many means of information, agriculture website is most popular with farmer users, because it not only provides professional agricultural information in real time and reflects the agricultural market dynamics, but also provides very abundant information.The peasants can browse the most comprehensive agricultural information, attend Real-time introduction of the national agricultural policies, adjusting planting structure, the sale of farm produce through agriculture website. But look at all of the current agricultural sites, are a common problem that Web site there are a lot of invalid information page.These invalid information webpages mainly include non-agricultural category webpages, agricultural category webpages without main contents and navigation webpages which we call"Garbage webpages in agricultural websites".The existing of "Garbage Webpages" seriously impede the farmers to get accurate market information timely.So as to help farmer users get accurate and useful agricultural information timely,we choose the appropriate webpage identifing models and improve them to remove "Garbage Webpages" from agriculture websites. In this article, on the basis of researching on a great number of data at home and abroad, I research strengths and weaknesses on Multiple Linear Regression, Naive Bayes, and Fisher.I make use of document frequency, Square test and JE, IK, Paoding's knives on the basis of these three webpages identifying method, then analyse and compare their test result. For its own part of an agricultural type, but the main content of page is blank pages and pages of normal type of the distinction between agriculture, I used the Naive Bayes and fisher both pattern recognition method, using the same feature extraction model and the Chinese Segmentation software.In the process of extracting features from webpages, according to the features about these webpages, I improved feature extracting model. I select phrase as the feature of webpages, instead of word. Taking advantage of this approach, we better realize the distinction between Normal pages and Garbage pages.The contents of this article are the key technologies of agricultural search engine in《Rural science and technology information service platform key technology research and application demonstration》, which is key scientific research project in Xinjiang Uygur Autonomous Region.
Keywords/Search Tags:Agriculture Websites, Garbage webpages, Pattern Recognition, Feature extraction model
PDF Full Text Request
Related items