Font Size: a A A

Research On Topic-specific Gathering And Classification Of Chinese WebPages

Posted on:2007-12-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:X J ZongFull Text:PDF
GTID:1118360242961736Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
The network is changing our life profoundly, and internet has developed to be the biggest information database. However, it's difficult for browsers to find what they need from the expanding information database rapidly and precisely. Information gathering and processing from World Wide Web attracts more and more attention.Because of the enormous and disordered WebPages, the traditional scalable Web crawling technology consumes too intensive system and network resources. Decentralization and dynamic developments of Web information are also problems for information gathering. Topic-specific Web search engine is a new direction of information retrieval. Rather than collecting and indexing all accessible Web documents, topic-specific Web search system restricts its crawl boundary to find links that likely to be most relevant to the given topic. The precision and recall of information search are prone to be guaranteed. In this paper, some sticking points, such as topic-specific information gathering and Classification of Chinese WebPages, are discussed as follows:Based on the analysis of Web structure and Web links, some useful rules are summed up for more effective topic-specific information gathering. Web metadata is defined and a few kinds of hyperlink and metadata are discussed. Information extraction is studied in this paper and some kinds of appropriate Web metadata are confirmed for topic-specific information gathering.Topic expansion is discussed to get the set of topic terms with its relevant topics, including stop words filtration, extraction of candidate topics and relevance metrics filtration. Using association mining on the database of metadata, the technologies of metadata extraction and topic expansion are proposed as a relevant topics mining algorithm. Experimental results indicate that our algorithm and strategies have better performance and precondition for topic-specific information gathering. Based on Web metadata, a topic-specific information gathering system is designed and the overall process is described. Two classic algorithms for topic-directed crawling founded on hyperlink, Hypertext Induced Topic Search (HITS) and PageRank, are discussed and analyzed. A set of algorithms, which exploit hyperlink metadata, that keep crawler focuses to the topic are presented. The utility of hyperlink metadata for betterment of HITS and PageRank is demonstrated and some ameliorative algorithms are proposed, such as M-PageRank and M-HITS. The capabilities of multiform algorithms are compared, and experimental results indicate that our approach has better performance and precondition for topic-specific search.An overview of text classification is reviewed in this paper. According to the semi-structured format of Web documents, a document representation method called TFE is adopted. Some classic weighting functions for characteristic words are revised and extended anchortexts are introduced for classifying Web pages. Giving attention to structure and content of Web documents, we put forward improved na?ve Bayesian algorithm and Support Vector Machine (SVM). The experiments results show that those approaches have better performance.In this thesis, we propose some amelioration for topic-specific information gathering and Classification of Chinese WebPages on techniques. Our approaches and algorithms have better performance and some conclusions drawn in this paper provide a guideline and basis for both theory and practice.
Keywords/Search Tags:Web Information Gathering, Topic-specific Information Gathering, Web Metadata, Topic Expansion, Relativity Judging, Classification of WebPages
PDF Full Text Request
Related items