Research On Duplicate Removal And Similarity Evaluation Of Chinese Agricultural Web Pages

Posted on:2015-04-19

Degree:Master

Type:Thesis

Country:China

Candidate:T Zhao

Full Text:PDF

GTID:2298330467474237

Subject:Agricultural mechanization project

Abstract/Summary:

With the rapid development of network information technology, construction of agriculturalinformation, service level has been greatly facilitated and improved. The massive and repetitive agriculturalinformation in the internet not only convenience to the friends who engages in agriculture, but also increasethe difficult of getting useful information quickly and accurately. How to manage the duplication andrepetition of similar web pages of agriculture effectively has become one of the important topics ofagriculture vertical search engine research field. The main work of this paper include the following aspects:1)Depth study the key technologies of removing text repetition and similarity judgments, webpretreatment, web page text content extraction, Chinese word segmentation, feature weighting algorithm,method of removing repetition web,method of text similarity calculation, similarity evaluation criteria.This article,which is based on agriculture web corpus, focuses on the technologies of removing repetitionweb, feature weighting algorithm and the method of similarity calculation.2)This paper research on the definition standards of the duplication and repetition of similar web pagesin Chinese agriculture, which has built a Chinese agriculture web corpus. A collection of web pagesidentified by manual has been build. The collection contains225pages set. Each web page has a2-14approximate duplicate pages. A total of1110web page as a test set.3)The Webpage pretreatment, removing set in exactly the same Webpage using the MD5method, andthen the rest Webpage extract text, word segmentation, word segmentation method using Paoding removestop words, respectively, using Boolean weighting, word frequency, inverse document frequency weight ofthree methods were weighted calculation of feature words; finally, we use three kinds of similarityalgorithm (vector space model, based on the HowNet semantic similarity, latent semantic analysis) on threedifferent weights of the feature vector space model of similarity calculation, finally got9group Chineseagricultural Webpage similarity judgment results.4)The accuracy, recall, F1measure of9experiments have been analyzed and compared. The resultsshow that no single feature weighting algorithm to determine the similarity has the absolute advantage. Allthree feature weighting algorithm in different similarity judgments have advantages and disadvantages. Theanalysis and comparison of different methods of similarity judgments shows that the method of similarityjudgments of latent semantic analysis has the best result.Through the MD5method to remove the41completely duplicate with other Webpage of Webpage,judging method of calculation on agricultural Webpage duplicate removal and similarity judgment isstudied combining weights using different similarity on the remaining1069Webpage. The analysis and theexperimental results, results show that latent semantic analysis combined with Boolean weighting valueobtained results, the agricultural Webpage similarity judgment has the best results, F1comprehensiveevaluation index is90.1%, and the accuracy was93.7%.

Keywords/Search Tags:

Chinese Agricultural Webpage, MD5, Vector Space Model, HowNet, Latent SemanticAnalysis

Related items

1	Research On Classification Algorithm For Chinese Webpage
2	Study Of Tonic Traditional Chinese Medicine Classification Methods Based On Near Infrared Spectroscopy
3	Research Of Chinese Text Automatic Summarization Based On Conceptual Vector Space Model
4	Automatic Classification Research On Chinese Web Document Orientation
5	Audio Scene Recognition Based On Probabilistic Latent Semantic Analysis
6	Researching The Application Of Latent Semantic Index To Chinese Document Clustering
7	Research Of Latent Semantic Analysis Based On Paragraph
8	Design And Implementation Of Content-based Webpage Collection And Classification System
9	Improved Vector Space Model And Its Application To Document Classification System
10	Research And Implementation Of Iis Webpage Trojan Detection System Based On Dom Model