Font Size: a A A

Tag Tree Template In The Pages Of Critical Information Extraction And Topic Identification

Posted on:2010-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:X W JiFull Text:PDF
GTID:2208360275991842Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the bloom of Internet,people exchange massive information on World Wide Web.How to collect information from Internet accurately with efficiency becomes an urgent concern. Research of topic detection on Web is not only useful to understand the current thoughts of people all over the world,but also helpful for scientists to study the society,politics and linguistics.A technology of information extraction by using tag tree template,as well as topic detection based on structural character of HTML content,will be introduced in this paper.Firstly the history and related technologies about topic detection on Web will be introduced. The technologies are based on Information Extraction and Data Mining.The process of information extraction from Web is both interesting and challenging,which could help Web Searching,Information Retrieval and Web Mining.Web pages on many sites are constructed dynamically as structural records based on a HTML template from a background Database.Tag Tree model will be used to parse HTML and extract templates from trees by using Tree Similarity,and pages are clustered into different template class. Further the tree template will be used in parsing HTML Tag Tree in classes to extract the exclusive content in documents.By finding repeating patterns and using some heuristic rules the schema of documents can be set up and records can be extracted from the exclusive content of templates.The experiment result shows that it is an effective way to extract structural information from Web pages especially of News sites and BBS sites.With the structural characters of HTML,such as style,font,location and link,feature of extracted text will be selected from Web.It is combined into Term Frequency Inverse Document Frequency(TF-IDF) vector of document and helps the topic detection of Web Document by using Hierarchical Agglomerative Clustering.A protocol design of Web Document Topic Detection System is finally shown in the paper. There are three main components of the system,i.e.Web Information Collector component, Data Interpretability component and Topic Detection component.
Keywords/Search Tags:Topic Detection, Information Extraction, Tag Tree Template, Web Structural Character
PDF Full Text Request
Related items