Font Size: a A A

Research On Matching Method Of Uncertain String

Posted on:2019-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:D WangFull Text:PDF
GTID:2428330566988982Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With a given text T and a pattern P,string matching refers to the procedure designed to find all the substrings identical to P in T.The application of string matching involves various subjects including bioinformatics,text editing,pattern recognition,natural language processing,search engine,etc.The rapid growth of Internet technologies has resulted in diversified sources of statistics,which leads to the generation of uncertain data.This paper focuses on the problem of uncertain string matching in the field of bioinformatics,and the research content is as follows.Firstly,a deep analysis is adopted on the current methodologies,and problems like long-time and large-scaled construction of index are discovered.Secondly,in view of the fact that there are few character types in bioinformatics,a index named USAL with space cost of character type multiplied by string length is proposed in this paper.Concepts of positive and negative character are proposed in USAL.Based on the positive character,an effective string matching algorithm,namely GUSM,is proposed combined with the Greedy algorithm and Range Maximum Query.The probability of each character occurrence in each position is recorded by USAL,therefore,when matching strings,the position of the positive character with the largest probability will be returned in the form of time O(1)by RMQ according to the given probability threshold before being verified.At last,results that meet the threshold will be returned.In order to improve the filtering effect of positive character position,filtering strategies based on minimum probability character and randomly-selected character are proposed.Thirdly,to enhance the process of matching,a small-scaled index named BI is adopted on the foundation of bitmap thought.The application of BI allows the probability of positons in USAL to be bitmapped,which promoted another highly-efficient string matching algorithm named BUSM.When matching strings,BUSM will turn the matching operation into bit manipulation via BI,and then verify the results through probability threshold before returning the final correct answer.To improve the efficiency of BI in the period of verifying,MBI is then introduced to decrease the size of the candidate results,which further upgrades the algorithm.Finally,experiments on real and synthetic datasets with different features are carried out to verify the efficientcy of the proposed algorithms.
Keywords/Search Tags:uncertain string, index construction, filtering strategy, substring matching
PDF Full Text Request
Related items