Semi-supervised BLOG Information Extraction Techniques Based On Document Structure

Posted on:2010-11-15

Degree:Master

Type:Thesis

Country:China

Candidate:B Li

Full Text:PDF

GTID:2178360332957858

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Various Blog researches such as topic detection, community discover, vertical search engine carry out together, structured Blog data is increasingly strong desire. But traditional Web information extraction technologies can't work in rich and flexible Blog data effectively, so the Blog-specific information extraction research is very urgent.This paper first analyzes the Blog page deeply and finds that the page always contain structural information and semantic information; these features present a Blog data format changing the original Blog data into text value and path pattern, which facilitates information extraction. This paper also finds that every Blog page has template-based, modular, and personalized features, which will bring about changes in the structure. So the information extract faces difficulties that the data source is not in unified data format. In order to solve these difficulties, this paper proposes a Blog page block algorithm based on tag node's sub-tree similarity (denoted as BPS-BSS), the algorithm adopts hierarchical cluster algorithm to cluster tag nodes, filter out tag nodes and extract Blog modules. Succeed in extracting Blog modules, information extraction algorithms only need to extract information within the module. Experiments show that the algorithm has high accuracy and low time complexity.After extracting Blog modules from Blog page, since Blog data usually contains the integrated semantic information of module, this paper presents ontology-based information extraction algorithm, which first establishes module concept and Blog concept, each Blog concept contains some sub information concepts, each sub information concept contains some data attributes. Inductive learning algorithm can get the data properties based on labeled pages; and the data properties can be used to generate information extraction rules. Experiments show that this algorithm not only improves the extraction rate also increases the extraction accuracy since it is within the module for information extraction.Based on the above researches, this paper designs and implements an experimental prototype system of Blog information extraction. This system includes asynchronous web crawler, the Blog page segment, extraction rules generation, and information extration algorithm, which can be used as a basic platform for relevant researches and experiments of information extraction.

Keywords/Search Tags:

Web page segment, module extract, Blog information extraction, domain ontology

PDF Full Text Request

Related items

1	Semi-supervised BLOG Information Extraction Techniques Based On Document Structure
2	An Ontology-based Domain Information Collection And Its Application
3	Adaptive Web Information Extraction Method Research Based On Ontology
4	Ontology-Based Structured Information Extraction From Web Pages
5	Domain Ontology-based Web Information Extraction Technology
6	A Research On Chinese Information Extraction Based On Construction Of Domain Ontology
7	Research On Web Information Extraction Based On Domain Knowledge
8	Construction And Implementation Of Domain Ontology Based On Plain Text
9	The Design And Implementation Of Content Filtration Model Based On Domain Ontology
10	Research On Key Technologies Of Ontology Construction Based On WordNet And Its Application In Security Domain