Font Size: a A A

Mail Address Automatic Extraction System Based On Search Engine Secondary Development

Posted on:2014-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:R LiuFull Text:PDF
GTID:2268330401488378Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Information extraction technology has become one of the current research topics, and on theso-called rich in the search engine returns the message data poor information issue to besolved,the combination of the two will be a very interesting and practical value of thing.This is known as combining well with search engines and information extraction techniques,developed an email address extraction system based on search engines.Solve many problemseffectively which are common among email address extraction software,such as the result’saccuracy is not high,user’s initiative choices are low, the result will be repeated extracted twicebefore and after.The main work content of this article and innovation are as follows:First, through the URL address splicing technology, call returns data for the major searchengines to retrieve the source data.User submits a search engine keywords and the need toaddress the starting page, URL of the homepage address from the search engine returns datastructures, add-on page URL link address.Contrast to previous studies, this article page forautomated extraction, that is, to achieve the " next page " link to get the address.Furthermore, inorder to increase the user to choose in the email system and users as needed, to limit the scope ofweb pages to be processed.Second, the htmlparser parsing in HTML pages, use a regular expression and extract emailaddresses. In order to obtain more comprehensive information, this article uses htmlparser deepinternal URL link to the Web page address extraction.The user can according to your needs,select the page you need to address level.Once again, to further improve the user to choose sex, the user can according to their ownneeds, select on the final mail server domain name in the search results (such as163.com,126.com, edu.cn, and so on) filter.In addition, in order to avoid the extract to the extraction ofinformation will not be repeated next time, choose to save the result in your Access database.Theresults of the extraction can also manually select to save the text file format.Last, to test the system, and improving for problems that occur, and systems analysis andevaluation of results, identifying system stability is well, working15hours(at8:00to23:00),sufficient to meet actual needs.And recall rate and accuracy of more than94percent.
Keywords/Search Tags:search engines, Email address extraction, HTMLParser, regular expressions
PDF Full Text Request
Related items