Font Size: a A A

Research And Implementation Of Text Duplication Check With Fuzzy Matching Algorithm In Cloud Computing Environment

Posted on:2019-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:C L LiFull Text:PDF
GTID:2428330545490151Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Now it has become an indispensable high-tech means in all fields.Aiming at the problems in the background of the project,the scale of the file,which the Ministry of Science and Technology needs to be compared with,is large scale.According to statistics,there are nearly one hundred thousand kinds of application materials each year,with tens of millions of words,and data scale is increasing year by year.Therefore,the traditional single machine processing system can't meet the requirements.This system adopts the advantage of distributed storage of cloud computing,stores a large amount of text data on cloud platform,and then achieves the rapid implementation of massive text checking based on the efficiency of cloud computing parallel computing.Accurately in the text string query is a common application in industry and academia,but also a substring based approximate matching operation.Although this method is simple and intuitive,the results have some limitations.For example,usually two seemingly repetitive text,the way of character matching may not find them out,mostly because of "stop words without meaning on some"(Stopwords)or reversing order of SVO and other ways to circumvent these duplicate detection system,Therefore,the traditional methods have been greatly challenged in terms of recall ratio.The removal of some stop words or function words by word segmentation is the mainstream method gradually adopted.In this method,how to segment and segment properly becomes the key problem to improve recall ratio and precision ratio.The most popular way is segmenting and checking by sentence at present,think of the sentences vary in length,some of the results do not work very well in practice,but it will have a greater impact on the overall similarity ratio.This paper is based on the use of participle,a comparison algorithm with matrix scanning strategy for constructing matrix model are proposed,converts the traditional text comparison operation to scanning and analysis of the matrix.The integration of the algorithm is realized by MapReduce,and the processing capability of the algorithm is optimized and improved by using the characteristics of high efficient parallel computing,the advantages of the algorithm comparing with Fast Exact String Matching Algorithm are also discussed.On this basis,a set of distributed text checking system based on Hadoop is designed and implemented,the system has been applied to the text check of the science and technology project.The matrix model used to analyze text fragments and the process design and implementation of distributed system have some inspiration and practical value for the research and development of such problems.
Keywords/Search Tags:Cloud computing, Similarity, Matrix model, Duplicate checking
PDF Full Text Request
Related items