The Research And Application Of Web Information Extraction Technology

Posted on:2012-10-13

Degree:Master

Type:Thesis

Country:China

Candidate:H Qian

Full Text:PDF

GTID:2218330338456215

Subject:Computer application technology

Abstract/Summary:

Today, the whole world becomes an "information village". People need more and more information. At the same time, how to obtain the needed information quickly and accurately has became the focus of the research of information extraction. As an important data source, the Internet also faces the problem how to extract the needed information from the massive web pages. According to statistics on the internet about 80% of the content on the internet exists in the Hidden Web, which are named online database system. Existing search engines cannot crawl to the data on these web pages, so we need a tool which can search and gather data from the Internet, and the tool also can make the data which have been extracted structural and standard, therefore, Web information extraction technology generated and developed.After studying the existing Web information extraction method, the author proposed two semi-automated methods. It is respectively:The Web Information Extraction Based on Regular Expressions and The Web Information Extraction Based on Time-Frequency Weighted DOM. The first method mainly use the function of regular expressions to realizes carries on the match of format the HTML's document that from common news websites, such as:locate,replace etc. And using the DOM tree generation algorithm produces a DOM tree. After users mark the information records, the system can get the selection rules. This kind of method has the very good time efficiency.The second method is on the ground of the existing information extraction methods:The Web Information Extract Based on DOM. It Transforms the HTML documents which is waiting to extract into a DOM-tree'structure, and then weights the DOM tree with temporal and frequency attributes, thereinto, the time attribute values are calculated by the extracting-time formulas. And the frequency attribute values are obtained of the feedback of the advocate using modules. This method considers the extraction time in the extraction process, in order to satisfy the situation which the multi-level management varies to the time timeliness request, also very suitable for application developer in terms of the data calls.

Keywords/Search Tags:

Information Extraction, Time-frequency Weighted, Regular Expressions, Extracting Rules

Related items

1	The Application And Research Of Regular Expression In Webpage Extration
2	Research On Temporal Relation Between Time Expressions And Events In Chinese Language
3	The Design And Implementation Of Regular Expression Engines Based On Deterministic Finite Automata
4	The Properties And Regular Expressions Of Two Types Of Fuzzy Finite Tree Automata
5	Research On Regular Expression Matching Of Network Data Flow
6	The Research Of Web Information Extraction Technique And Application Based On NFA Regular Matching
7	Learning Resources. Virtual Learning Community The Automatic Generation System Design And Implementation,
8	Research Of Compression Algorithm For Deep Packet Based On Regular Expressions
9	The Research And Implementation Of Web Information Extraction System Based On The Regular Expression
10	Mail Address Automatic Extraction System Based On Search Engine Secondary Development