Reaearch And Implementation Of Duplicate Checking System Under Internet Environment

Posted on:2019-07-09

Degree:Master

Type:Thesis

Country:China

Candidate:G Li

Full Text:PDF

GTID:2348330542472646

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the continuous development of information,users have been used to access information through the Internet.The Internet has brought us convenience,but also brought a lot of problems to be solved.Aiming at the problem of information filtering and retrieval and content plagiarism in the Internet,combined with information retrieval and text mining technology,this paper completes the construction of Chinese duplicate checking system based on netword.This system is based on the network environment,collecting web data from the Internet as a contrast library for text checking.Based on the research of data mining algorithm,the algorithm is optimized by using the idea of genetic algorithm and then applied to the text mining.The research content of this article can be divided into the following aspects:1)Combined with the network information retrieval system and the check and retrieval system,the system structure is constructed,combining the characteristics of the check and check system without limiting the number of query words with the large-scale real-time data of the network information retrieval system.2)Proposed a similarity comparison model,improve the process of document similarity comparison clearly,the comparison document segmentation processing,extracted feature items,expressed as a form of space vector,calculated by calculating the cosine angle of the way the text similarity.And the contrast process is divided into two steps: preliminary comparison and detailed comparison.The similar documents are initially compared,then the similar documents are compared in detail,which can handle more than one to many similar cases.3)The data mining algorithm is applied to the text domain,and text mining is carried out in the database.After the feature extraction and text representation,the text can be processed in mathematical form.Then,feature extraction and text clustering and text classification mining are used to extract the hidden features in the text,and apply them to the storage and query of document database,so as to improve the system performance.4)The techniques and algorithms used in text mining are optimized with the help of the idea of genetic algorithm and the excellent space search ability.In text feature extraction,the feature words are extracted by genetic algorithm to reduce the influence of noise.In text clustering,the genetic algorithm is used to optimize the initial cluster center to improve the clustering effect.In text classification,the semantic mining and classification algorithms are optimized respectively.5)Based on the research of each part of the system architecture,the system of Chinese check and check in the network environment is implemented.The system has designed the interactive part of the system,which is user centered and in a simple and friendly design concept.The user uploads the query document easy to operate,the returned result is distinguished by color,and the URL address of similar text is attached.

Keywords/Search Tags:

Search Engines, Web Crawler, Natural Language Processing, Text Mining, Clustering, Index, Genetic Algorithm, Similarity

PDF Full Text Request

Related items

1	Research And Application Of Web Text Mining Based On Crawler
2	The Focused Crawler Based On URL And Context
3	Research On Text Clustering Algorithm Based On Word Frequency And Semantic
4	The Research And Implementation Of Topical Web Crawler Based On Improved Shark-Search Algorithm
5	Text Similarity Analysis Technology Based On Deep Learning And Its Application In Auxiliary Decision-making Of HIA
6	Research On Full-featured Text Search In Natural Language Understanding
7	Research And Implementation Of Intelligent QA Enhancement System For Vertical Domain
8	Research On Text Similarity Based On Bert
9	Design And Implementation Of Clickbait News Detect System Based On Natural Language Processing
10	An Improved K-Means Algorithm And Its Application In Bidding Data Analysis