Font Size: a A A

Identification Of Regular Expressions Based On Continuous Repeated Substring Left Alignment

Posted on:2020-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:G LinFull Text:PDF
GTID:2415330590963047Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The inference of formal language is devoted to the study of how to derive the definition of the language from the limited information.Regular language is a class of formal language widely used.The inference algorithm for regular expressions is widely used in gene sequence identification,XML schema inference,graph database path query learning,information extraction,etc.Therefore,research on the inference algorithm for regular expressions not only has important theoretical significance,but also has great practical application value.Identification in the limit is a classical model of language learning.Studying regular expressions inference algorithms in this model ensures the good characteristics of the algorithm.This paper studies regular expressions learning algorithms based on this model.The contributions are listed as follows.(1)This paper has proposed a learning framework of inference regular expressions based on continuous repeated substring detection.Firstly,the framework seeks for the longest continuous repeated substring.Secondly,the framework blocks based on continuous repeated substring.Thirdly,the framework matches the blocks leftmost.Finally,the framework generalizes the results to regular expressions.Studying regular expressions learning algorithms in this framework can identify a class of regular expressions for which unary operator can act on multiplied characters.This overcomes the limitation of most existing algorithms.(2)This paper has proposed two generalization strategies and implemented two algorithms for standard regular expressions and regular expressions with counting respectively.This paper has analyzed the subclass of standard regular expressions and the subclass of regular expressions with counting that can be identified from the two algorithms.This paper also has provided the construction method of characteristic sample for each algorithm.In order to verify the theoretical analysis results,this paper has developed a set of tools for the generation of expressions and their characteristic sample.The experimental results verified the correctness of the theoretical analysis.This paper has showed the advantages of the algorithms by comparing to the existing algorithm.(3)This paper has discussed the application of the algorithm in graph database query learning and analyzed that for a class of reachability query,the path constraint which is defined by regular expressions is consistent with the subclass of regular expressions with counting as proposed in this paper.This paper has discussed the general steps of reachability query learning and investigated the application of the algorithm in reachability query learning by examples.
Keywords/Search Tags:regular expressions, learning, identification in the limit, continuous repeated substring, reachability query
PDF Full Text Request
Related items