Effective use of term relationships in Web content mining

Posted on:2008-04-12

Degree:Ph.D

Type:Dissertation

University:Arizona State University

Candidate:Gelgi, Fatih

Full Text:PDF

GTID:1458390005481038

Subject:Artificial Intelligence

Abstract/Summary:

PDF Full Text Request

Currently, Web mining research follows three directions: Web structure mining, Web usage mining, and Web content mining. Our focus will be on Web content mining algorithms. Web content mining discovers useful information from the content of the Web pages. In this work, we extract a relational graph of a given document collection. Our claim is, this relational graph can play an important role as a statistical domain model of the document collection.; Relational graph is an undirected, node and edge weighted graph where nodes represent terms and edges represent term co-occurrence relationships. We develop algorithms utilizing the relational graph in two important Web content mining applications: Web data annotation and clustering of Web search results.; Web data annotation. Weakly annotated data is the annotated data typically generated by (semi) automated information extraction systems. The extracted data suffers from two major problems: incorrect metadata vs. data annotations and missing labels. We present two efficient and scalable techniques for improving the quality of weakly annotated data by re-annotating it: a spreading activation network and a simple Bayesian classifier. Both spreading activation and simple Bayesian classifier have been able correct annotations of labels by using prior and conditional probabilities that are initialized and inferred thru the relational graph.; Furthermore, partial observable nature of the Web data allows us to extend this work by developing a contextual EM model for simple Bayesian models. EM estimates the prior and conditional probabilities of the Bayesian model. Our EM model follows the methodology of Baum-Welch. In the expectation step, simple Bayesian model computes the role probability distributions of all labels in the Web pages. In the maximization step, labels are re-annotated and classifier probabilities are re-estimated.; Clustering of Web search results. Term ranking is important for feature generation during clustering and cluster labeling of a Web page collection in order to create highly precise guides for browsing search results. We follow two different approaches for term ranking; neighbor based measures and a random walk approach, named TermRank, both utilizing the relational graph. In this problem, relational graph uncovers statistical information about important neighbors and strong associations between terms within different contexts. We show that identifying and ranking distinguishing terms higher by using TermRank leads to better estimation of similarity between documents and higher quality clustering.

Keywords/Search Tags:

Web, Term, Relational graph, Simple bayesian, Clustering, Data

PDF Full Text Request

Related items

1	Research On Algorithm For Relational Data Classification Based On Background Knowledge
2	The Study On Complex Relational Data Visualization Technology
3	Spectral Analysis And Clustering Of Relational Graphs
4	Research On Relation Graph Clustering Algorithm Based On Matrix Volume
5	Research On Big Data Analysis Of Scientific And Technical Data Based On Relational Graph
6	Robust Model Fitting Based On Graph Clustering
7	Research And Application Of Graph Learning Recommendation Algorithm Based On Multivariate Data Analysis
8	Research On Multi-relational Clustering Analysis Approaches
9	Native Graph Support in Relational Data System
10	Research On Multi-instance Learning Algorithm Based On Graph Structure Features