Font Size: a A A

Preliminary Research On Information Gathering Of Olympic-Oriented Chinese Web Pages

Posted on:2005-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:X G SunFull Text:PDF
GTID:2168360152467697Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, large amount of resources can be obtained from each part of the world. But it also comes to be more and more difficult to retrieve relevant information because of the explosion of resources. So, even higher requirements are put forward for advancing the technology of gathering information. Due to the complexity of Web pages, general search engines have found more difficult to meet users' demand. Hence special method of information acquirement is becoming a good direction. This thesis focus on the problem of gathering information of Olympic-oriented Chinese Web pages (Digital Olympic). The main task is to filter Olympic Web pages from others and categorize them, so as to provide more accurate information for users. This thesis studies some techniques of information gathering and makes some preparation for future vertical Search Engines. Studies in this thesis include:(1)Statistics and analysis are made on Olympic Web pages which include temporal and spacial distribution, characteristics of content and terms using, so provide direct foundation for further research work. An idea of construct hierachical term list is put forward to meet the need of gathering information from Olympic-oriented Web pages.(2)Aim to Olympic web page filtering, a variety of feature selection methods, classifiers and metrics used for classification result have been covered. And substitutes of combination of feature selection methods and classifiers are tested on data sets. Experiments show that some methods of feature selection such as Information Gain(IG), Cross Entropy(CE), CHI and Weight of Evidence Text perform excellently when they are associated with Na?ve Bayes.(2)Considering the dynamic, sequence and timeliness of Olmpic Web pages, an Improved Rocchio Algrithm and an Adaptive_Classification which is based on incremental study are put forward. Without the aids of human intelligence, computers could not make accurate judgements when they are in the face of complicated web pages. If improper evaluations are used to adjust classes' model, the performance may be confronted with the risk of deterioration. In this thesis, an idea of setting positive and negative reliable scope and a strategy of using dynamic coefficients to adjust the models of classes are put forward, and good results are obtained.(4)A clustering method that combines Optics and K-Nearest is put forward, which is an auxiliary means to mining potential information from Web pages. Because Web pages are non-structured data and their features are sparse, the vector space is much more complex. As single points in set, the distribution among web pages is also tanglesome: densities are different among clusters and the shape of clusters are irregular. Optics can recognize high density area while perform poor on sparse points. Thus it could be used to construct initial clusters with density areas and pick up other points by K-Nearest classifier.
Keywords/Search Tags:Digital Olympic, Web Page Filtering, Feature Selection, Adaptive Classification, Clustering based on Density
PDF Full Text Request
Related items