Font Size: a A A

Semi-supervised BLOG Information Extraction Techniques Based On Document Structure

Posted on:2010-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:B LiFull Text:PDF
GTID:2178360332957858Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Various Blog researches such as topic detection, community discover, vertical search engine carry out together, structured Blog data is increasingly strong desire. But traditional Web information extraction technologies can't work in rich and flexible Blog data effectively, so the Blog-specific information extraction research is very urgent.This paper first analyzes the Blog page deeply and finds that the page always contain structural information and semantic information; these features present a Blog data format changing the original Blog data into text value and path pattern, which facilitates information extraction. This paper also finds that every Blog page has template-based, modular, and personalized features, which will bring about changes in the structure. So the information extract faces difficulties that the data source is not in unified data format. In order to solve these difficulties, this paper proposes a Blog page block algorithm based on tag node's sub-tree similarity (denoted as BPS-BSS), the algorithm adopts hierarchical cluster algorithm to cluster tag nodes, filter out tag nodes and extract Blog modules. Succeed in extracting Blog modules, information extraction algorithms only need to extract information within the module. Experiments show that the algorithm has high accuracy and low time complexity.After extracting Blog modules from Blog page, since Blog data usually contains the integrated semantic information of module, this paper presents ontology-based information extraction algorithm, which first establishes module concept and Blog concept, each Blog concept contains some sub information concepts, each sub information concept contains some data attributes. Inductive learning algorithm can get the data properties based on labeled pages; and the data properties can be used to generate information extraction rules. Experiments show that this algorithm not only improves the extraction rate also increases the extraction accuracy since it is within the module for information extraction.Based on the above researches, this paper designs and implements an experimental prototype system of Blog information extraction. This system includes asynchronous web crawler, the Blog page segment, extraction rules generation, and information extration algorithm, which can be used as a basic platform for relevant researches and experiments of information extraction.
Keywords/Search Tags:Web page segment, module extract, Blog information extraction, domain ontology
PDF Full Text Request
Related items