Font Size: a A A

Research On Duplicate Removal And Similarity Evaluation Of Chinese Agricultural Web Pages

Posted on:2015-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhaoFull Text:PDF
GTID:2298330467474237Subject:Agricultural mechanization project
Abstract/Summary:PDF Full Text Request
With the rapid development of network information technology, construction of agriculturalinformation, service level has been greatly facilitated and improved. The massive and repetitive agriculturalinformation in the internet not only convenience to the friends who engages in agriculture, but also increasethe difficult of getting useful information quickly and accurately. How to manage the duplication andrepetition of similar web pages of agriculture effectively has become one of the important topics ofagriculture vertical search engine research field. The main work of this paper include the following aspects:1)Depth study the key technologies of removing text repetition and similarity judgments, webpretreatment, web page text content extraction, Chinese word segmentation, feature weighting algorithm,method of removing repetition web,method of text similarity calculation, similarity evaluation criteria.This article,which is based on agriculture web corpus, focuses on the technologies of removing repetitionweb, feature weighting algorithm and the method of similarity calculation.2)This paper research on the definition standards of the duplication and repetition of similar web pagesin Chinese agriculture, which has built a Chinese agriculture web corpus. A collection of web pagesidentified by manual has been build. The collection contains225pages set. Each web page has a2-14approximate duplicate pages. A total of1110web page as a test set.3)The Webpage pretreatment, removing set in exactly the same Webpage using the MD5method, andthen the rest Webpage extract text, word segmentation, word segmentation method using Paoding removestop words, respectively, using Boolean weighting, word frequency, inverse document frequency weight ofthree methods were weighted calculation of feature words; finally, we use three kinds of similarityalgorithm (vector space model, based on the HowNet semantic similarity, latent semantic analysis) on threedifferent weights of the feature vector space model of similarity calculation, finally got9group Chineseagricultural Webpage similarity judgment results.4)The accuracy, recall, F1measure of9experiments have been analyzed and compared. The resultsshow that no single feature weighting algorithm to determine the similarity has the absolute advantage. Allthree feature weighting algorithm in different similarity judgments have advantages and disadvantages. Theanalysis and comparison of different methods of similarity judgments shows that the method of similarityjudgments of latent semantic analysis has the best result.Through the MD5method to remove the41completely duplicate with other Webpage of Webpage,judging method of calculation on agricultural Webpage duplicate removal and similarity judgment isstudied combining weights using different similarity on the remaining1069Webpage. The analysis and theexperimental results, results show that latent semantic analysis combined with Boolean weighting valueobtained results, the agricultural Webpage similarity judgment has the best results, F1comprehensiveevaluation index is90.1%, and the accuracy was93.7%.
Keywords/Search Tags:Chinese Agricultural Webpage, MD5, Vector Space Model, HowNet, Latent SemanticAnalysis
PDF Full Text Request
Related items