Font Size: a A A

The Application And Research Of Regular Expression In Webpage Extration

Posted on:2015-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z B ZuoFull Text:PDF
GTID:2268330428982818Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of Internet, people are more and more accustomed to get information from the Internet through a variety of terminals (PC, tablet, phone, etc.). The Web is a huge repository that contains all kinds of valuable information. Web-based information extraction technology is the study of how to accurately extract the required information from the Web pages to meet the demand of the user, and put them to structured information. For example, in the form of a database to facilitate the use of statistical analysis. In this paper, based on regular expressions related technologies, In Scholar Google’s paper collection and Okooo.com’s lottery analysis case study, provide the Solution of automatic extraction of information on Website. On the basis functions of realization of regular expressions based NFA engine to extract Webpage, the paper also carried an analysis and Comparison based on NFA engine optimization and NFA engine in conjunction with DFA engine used.The solution is:First, use the tool RegexBuddy3to debug and optimize the regular expressions. Second, under.Net platform, through the use of tested regular expressions, write code to read Web source files, match extract fields and stored In Oracle database. In this paper, the method can automatically browse the target Web site, batch reading, record and field extraction of high accuracy, support the filter of HTML tags and a variety of data collection.
Keywords/Search Tags:Regular Expressions, Web information collection, lottery, Scholar Google, Information Extraction
PDF Full Text Request
Related items