Font Size: a A A

The Research And Application Of Web Information Extraction Technology

Posted on:2012-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:H QianFull Text:PDF
GTID:2218330338456215Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Today, the whole world becomes an "information village". People need more and more information. At the same time, how to obtain the needed information quickly and accurately has became the focus of the research of information extraction. As an important data source, the Internet also faces the problem how to extract the needed information from the massive web pages. According to statistics on the internet about 80% of the content on the internet exists in the Hidden Web, which are named online database system. Existing search engines cannot crawl to the data on these web pages, so we need a tool which can search and gather data from the Internet, and the tool also can make the data which have been extracted structural and standard, therefore, Web information extraction technology generated and developed.After studying the existing Web information extraction method, the author proposed two semi-automated methods. It is respectively:The Web Information Extraction Based on Regular Expressions and The Web Information Extraction Based on Time-Frequency Weighted DOM. The first method mainly use the function of regular expressions to realizes carries on the match of format the HTML's document that from common news websites, such as:locate,replace etc. And using the DOM tree generation algorithm produces a DOM tree. After users mark the information records, the system can get the selection rules. This kind of method has the very good time efficiency.The second method is on the ground of the existing information extraction methods:The Web Information Extract Based on DOM. It Transforms the HTML documents which is waiting to extract into a DOM-tree'structure, and then weights the DOM tree with temporal and frequency attributes, thereinto, the time attribute values are calculated by the extracting-time formulas. And the frequency attribute values are obtained of the feedback of the advocate using modules. This method considers the extraction time in the extraction process, in order to satisfy the situation which the multi-level management varies to the time timeliness request, also very suitable for application developer in terms of the data calls.
Keywords/Search Tags:Information Extraction, Time-frequency Weighted, Regular Expressions, Extracting Rules
PDF Full Text Request
Related items