Font Size: a A A

Study On Web Data Processing Technology

Posted on:2005-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:D ShenFull Text:PDF
GTID:2168360152967695Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
World Wide Web is becoming one of the main media for people and it is also becoming the huge data warehouse and the source of latent knowledge for human being. How to make it easy for people to process and utilize the online data turns to be an inevitable challenge for people.This paper works on the online data process techniques from three aspects: web page classification, web page summarization and Email clustering.As to web page classification, this paper firstly analyzes the difficulties in this field and puts forward a new feature selection approach. Based on query log, a new way to classify web pages is brought forward. What's more, this paper analyzes the query log and discovers a kind of "Implicit Link" between web pages. The classification performance based on "Implicit Link" outperforms "Hyperlink" obviously. Web page classification through summarization is also investigated in this paper and the experimental results validate this approach.This paper proposes a new automatic Web page summarization algorithm, which extracts the main topic of a Web page by page-layout analysis. Inspired by the two measurements of summarization –"Diversity" and "Non-Redundancy", a novel summarization approach named "Affinity Rank-Based Summarization" is advanced. After analyzing the query log, this paper improves several traditional summarization methods.A novel algorithm is also brought forward in this paper to utilize the natural language processing technique and frequent itemset mining technique to automatically generate meaningful patterns from documents and such patterns are employed to improve the performance of Email clustering. This paper also proves the effect of web page summarization on web page clustering.In addition, this paper describes a system developed by the author for classification and summarization which is a sub-system of the 973 project named "The effective algorithm and software system for data integration, data storage and data mining on World Wide Web".
Keywords/Search Tags:web mining, web page classification, web page summarization, web page clustering, Email clustering, feature selection, query log, imlicit link, GSP, Content Body.
PDF Full Text Request
Related items