Font Size: a A A

Reaearch And Implementation Of Duplicate Checking System Under Internet Environment

Posted on:2019-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:G LiFull Text:PDF
GTID:2348330542472646Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of information,users have been used to access information through the Internet.The Internet has brought us convenience,but also brought a lot of problems to be solved.Aiming at the problem of information filtering and retrieval and content plagiarism in the Internet,combined with information retrieval and text mining technology,this paper completes the construction of Chinese duplicate checking system based on netword.This system is based on the network environment,collecting web data from the Internet as a contrast library for text checking.Based on the research of data mining algorithm,the algorithm is optimized by using the idea of genetic algorithm and then applied to the text mining.The research content of this article can be divided into the following aspects:1)Combined with the network information retrieval system and the check and retrieval system,the system structure is constructed,combining the characteristics of the check and check system without limiting the number of query words with the large-scale real-time data of the network information retrieval system.2)Proposed a similarity comparison model,improve the process of document similarity comparison clearly,the comparison document segmentation processing,extracted feature items,expressed as a form of space vector,calculated by calculating the cosine angle of the way the text similarity.And the contrast process is divided into two steps: preliminary comparison and detailed comparison.The similar documents are initially compared,then the similar documents are compared in detail,which can handle more than one to many similar cases.3)The data mining algorithm is applied to the text domain,and text mining is carried out in the database.After the feature extraction and text representation,the text can be processed in mathematical form.Then,feature extraction and text clustering and text classification mining are used to extract the hidden features in the text,and apply them to the storage and query of document database,so as to improve the system performance.4)The techniques and algorithms used in text mining are optimized with the help of the idea of genetic algorithm and the excellent space search ability.In text feature extraction,the feature words are extracted by genetic algorithm to reduce the influence of noise.In text clustering,the genetic algorithm is used to optimize the initial cluster center to improve the clustering effect.In text classification,the semantic mining and classification algorithms are optimized respectively.5)Based on the research of each part of the system architecture,the system of Chinese check and check in the network environment is implemented.The system has designed the interactive part of the system,which is user centered and in a simple and friendly design concept.The user uploads the query document easy to operate,the returned result is distinguished by color,and the URL address of similar text is attached.
Keywords/Search Tags:Search Engines, Web Crawler, Natural Language Processing, Text Mining, Clustering, Index, Genetic Algorithm, Similarity
PDF Full Text Request
Related items