Research And Implementation Of Text Duplication Check With Fuzzy Matching Algorithm In Cloud Computing Environment

Posted on:2019-01-28

Degree:Master

Type:Thesis

Country:China

Candidate:C L Li

Full Text:PDF

GTID:2428330545490151

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Now it has become an indispensable high-tech means in all fields.Aiming at the problems in the background of the project,the scale of the file,which the Ministry of Science and Technology needs to be compared with,is large scale.According to statistics,there are nearly one hundred thousand kinds of application materials each year,with tens of millions of words,and data scale is increasing year by year.Therefore,the traditional single machine processing system can't meet the requirements.This system adopts the advantage of distributed storage of cloud computing,stores a large amount of text data on cloud platform,and then achieves the rapid implementation of massive text checking based on the efficiency of cloud computing parallel computing.Accurately in the text string query is a common application in industry and academia,but also a substring based approximate matching operation.Although this method is simple and intuitive,the results have some limitations.For example,usually two seemingly repetitive text,the way of character matching may not find them out,mostly because of "stop words without meaning on some"(Stopwords)or reversing order of SVO and other ways to circumvent these duplicate detection system,Therefore,the traditional methods have been greatly challenged in terms of recall ratio.The removal of some stop words or function words by word segmentation is the mainstream method gradually adopted.In this method,how to segment and segment properly becomes the key problem to improve recall ratio and precision ratio.The most popular way is segmenting and checking by sentence at present,think of the sentences vary in length,some of the results do not work very well in practice,but it will have a greater impact on the overall similarity ratio.This paper is based on the use of participle,a comparison algorithm with matrix scanning strategy for constructing matrix model are proposed,converts the traditional text comparison operation to scanning and analysis of the matrix.The integration of the algorithm is realized by MapReduce,and the processing capability of the algorithm is optimized and improved by using the characteristics of high efficient parallel computing,the advantages of the algorithm comparing with Fast Exact String Matching Algorithm are also discussed.On this basis,a set of distributed text checking system based on Hadoop is designed and implemented,the system has been applied to the text check of the science and technology project.The matrix model used to analyze text fragments and the process design and implementation of distributed system have some inspiration and practical value for the research and development of such problems.

Keywords/Search Tags:

Cloud computing, Similarity, Matrix model, Duplicate checking

PDF Full Text Request

Related items

1	Research On Parallel Query And Checking Algorithm Of Formal Methods Checking Based On Cloud Computing Platform
2	Research And System Development Of Content Duplicate Chechking In E-business Website Based On Semantics
3	Research And Implementation Of Temporal Logic Model Checking Algorithm Based On Cloud Computing Platform
4	Research On Model Checking Supportive Privacy Modeling Method In Cloud Computing
5	Efficient external-memory graph search for model checking
6	Study On Secure Outsourcing Schemes Of Matrix Computing In Cloud Computing Environment
7	Heavy Title Detection Of Text Classification And Similarity-based Study
8	Bread First Search Based Model Checking
9	Research On Secure Outsourcing Schemes Of Large Matrix Computing In Cloud Computing
10	The Research On Outsourcing Computing For Large Matrix Multiplications In Cloud Computing