Font Size: a A A

Research And System Realization Of Key Technology Of Information Extraction Optimization

Posted on:2020-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:B HuangFull Text:PDF
GTID:2428330572473591Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Big Data,a large number of valuable data and information is produced in daily operation and informatization of enterprises.Therefore,how to extract and analyze the truly useful information quickly and accurately from the massive dispersive data is an important research object in the field of data mining at present.Text Information Extraction Technology is one of the core problems in the field of data mining.In some application scenarios which semantic structure is explicit,the rules-based information extraction method has excellent performance in both the accuracy and recall rate of extraction.For large scale data to be extracted,the key technology to improve the efficiency of information extraction system is to improve the matching speed of regular expressions.In this context,this paper makes an in-depth study of the information extraction technology based on regular expression matching.Through the comparison and analysis of several classical algorithms in the field of regular expression matching acceleration,aiming at the problems existing in the jump-lookup table of the original DFA algorithm,the thesis proposed the design scheme of compressing algorithm based on character-grouping lookup table.Thus the matching speed of regular expression is improved.Then the essay designed and realized the information extraction system based on this optimization scheme on the laboratory FPGA hardware platform.This paper introduces the main tasks,common methods and evaluation criteria of the information extraction system firstly,and then introduces the common methods of regular expression matching technology and the research status of the matching process.In the next chapter,by analyzing the technical bottleneck of the existing regular expression matching technology,the thesis proposed a regular expression matching optimization algorithm based on character grouping,and the performance of the algorithm is tested and analyzed.The experimental results show that,compared with the original lookup table structure,the space usage can be compressed about 3 0%and the.average time spending of the matching of single character can be shortened over 50%.Based on the above optimization algorithm,the information extraction system is designed and realized in this paper.The system mainly takes the judicial documents in the field of refereeing documents,the punishment documents of the Ministry of Environmental Protection and the key information of the SFC's punishment documents as an example,and the main information in the text is extracted and stored in the database.Then the function and performance of the system is tested.The experimental results show that the method proposed in this paper has high accuracy and recall rate for normal data,and improves the extraction performance of this kind of system to a certain extent.
Keywords/Search Tags:information extraction, regular expression, DFA, character group
PDF Full Text Request
Related items