Effective use of term relationships in Web content mining | Posted on:2008-04-12 | Degree:Ph.D | Type:Dissertation | University:Arizona State University | Candidate:Gelgi, Fatih | Full Text:PDF | GTID:1458390005481038 | Subject:Artificial Intelligence | Abstract/Summary: | PDF Full Text Request | Currently, Web mining research follows three directions: Web structure mining, Web usage mining, and Web content mining. Our focus will be on Web content mining algorithms. Web content mining discovers useful information from the content of the Web pages. In this work, we extract a relational graph of a given document collection. Our claim is, this relational graph can play an important role as a statistical domain model of the document collection.; Relational graph is an undirected, node and edge weighted graph where nodes represent terms and edges represent term co-occurrence relationships. We develop algorithms utilizing the relational graph in two important Web content mining applications: Web data annotation and clustering of Web search results.; Web data annotation. Weakly annotated data is the annotated data typically generated by (semi) automated information extraction systems. The extracted data suffers from two major problems: incorrect metadata vs. data annotations and missing labels. We present two efficient and scalable techniques for improving the quality of weakly annotated data by re-annotating it: a spreading activation network and a simple Bayesian classifier. Both spreading activation and simple Bayesian classifier have been able correct annotations of labels by using prior and conditional probabilities that are initialized and inferred thru the relational graph.; Furthermore, partial observable nature of the Web data allows us to extend this work by developing a contextual EM model for simple Bayesian models. EM estimates the prior and conditional probabilities of the Bayesian model. Our EM model follows the methodology of Baum-Welch. In the expectation step, simple Bayesian model computes the role probability distributions of all labels in the Web pages. In the maximization step, labels are re-annotated and classifier probabilities are re-estimated.; Clustering of Web search results. Term ranking is important for feature generation during clustering and cluster labeling of a Web page collection in order to create highly precise guides for browsing search results. We follow two different approaches for term ranking; neighbor based measures and a random walk approach, named TermRank, both utilizing the relational graph. In this problem, relational graph uncovers statistical information about important neighbors and strong associations between terms within different contexts. We show that identifying and ranking distinguishing terms higher by using TermRank leads to better estimation of similarity between documents and higher quality clustering. | Keywords/Search Tags: | Web, Term, Relational graph, Simple bayesian, Clustering, Data | PDF Full Text Request | Related items |
| |
|