Tag Tree Template In The Pages Of Critical Information Extraction And Topic Identification

Posted on:2010-10-28

Degree:Master

Type:Thesis

Country:China

Candidate:X W Ji

Full Text:PDF

GTID:2208360275991842

Subject:Computer application technology

Abstract/Summary:

With the bloom of Internet,people exchange massive information on World Wide Web.How to collect information from Internet accurately with efficiency becomes an urgent concern. Research of topic detection on Web is not only useful to understand the current thoughts of people all over the world,but also helpful for scientists to study the society,politics and linguistics.A technology of information extraction by using tag tree template,as well as topic detection based on structural character of HTML content,will be introduced in this paper.Firstly the history and related technologies about topic detection on Web will be introduced. The technologies are based on Information Extraction and Data Mining.The process of information extraction from Web is both interesting and challenging,which could help Web Searching,Information Retrieval and Web Mining.Web pages on many sites are constructed dynamically as structural records based on a HTML template from a background Database.Tag Tree model will be used to parse HTML and extract templates from trees by using Tree Similarity,and pages are clustered into different template class. Further the tree template will be used in parsing HTML Tag Tree in classes to extract the exclusive content in documents.By finding repeating patterns and using some heuristic rules the schema of documents can be set up and records can be extracted from the exclusive content of templates.The experiment result shows that it is an effective way to extract structural information from Web pages especially of News sites and BBS sites.With the structural characters of HTML,such as style,font,location and link,feature of extracted text will be selected from Web.It is combined into Term Frequency Inverse Document Frequency(TF-IDF) vector of document and helps the topic detection of Web Document by using Hierarchical Agglomerative Clustering.A protocol design of Web Document Topic Detection System is finally shown in the paper. There are three main components of the system,i.e.Web Information Collector component, Data Interpretability component and Topic Detection component.

Keywords/Search Tags:

Topic Detection, Information Extraction, Tag Tree Template, Web Structural Character

Related items

1	A Method For Extracting The Topic Information In Webpages Based On The DIV Tag-Trees
2	Research Of Character Recognition Algorithm Based On Template Matching And Structural Characteristic
3	Research And Application Of Automatic Data Extraction From Template-generated Web Pages
4	Technology Research, The Concept Of Tree-based Web Information Extraction
5	Research On Automatic Web Information Extraction Technique
6	Performance Optimization Of Web Topic Detection System
7	Research Of License Plate Recognition Technology
8	Research On Automatic And Efficient Technologies For Web Information Extraction
9	Research And Application Of Template Extraction And Anomaly Detection Based On Log Information
10	Topic Chain-based Topic Information Extraction From Chinese Food Complaint Documents