Preliminary Research On Information Gathering Of Olympic-Oriented Chinese Web Pages

Posted on:2005-11-02

Degree:Master

Type:Thesis

Country:China

Candidate:X G Sun

Full Text:PDF

GTID:2168360152467697

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of Internet, large amount of resources can be obtained from each part of the world. But it also comes to be more and more difficult to retrieve relevant information because of the explosion of resources. So, even higher requirements are put forward for advancing the technology of gathering information. Due to the complexity of Web pages, general search engines have found more difficult to meet users' demand. Hence special method of information acquirement is becoming a good direction. This thesis focus on the problem of gathering information of Olympic-oriented Chinese Web pages (Digital Olympic). The main task is to filter Olympic Web pages from others and categorize them, so as to provide more accurate information for users. This thesis studies some techniques of information gathering and makes some preparation for future vertical Search Engines. Studies in this thesis include:(1)Statistics and analysis are made on Olympic Web pages which include temporal and spacial distribution, characteristics of content and terms using, so provide direct foundation for further research work. An idea of construct hierachical term list is put forward to meet the need of gathering information from Olympic-oriented Web pages.(2)Aim to Olympic web page filtering, a variety of feature selection methods, classifiers and metrics used for classification result have been covered. And substitutes of combination of feature selection methods and classifiers are tested on data sets. Experiments show that some methods of feature selection such as Information Gain(IG), Cross Entropy(CE), CHI and Weight of Evidence Text perform excellently when they are associated with Na?ve Bayes.(2)Considering the dynamic, sequence and timeliness of Olmpic Web pages, an Improved Rocchio Algrithm and an Adaptive_Classification which is based on incremental study are put forward. Without the aids of human intelligence, computers could not make accurate judgements when they are in the face of complicated web pages. If improper evaluations are used to adjust classes' model, the performance may be confronted with the risk of deterioration. In this thesis, an idea of setting positive and negative reliable scope and a strategy of using dynamic coefficients to adjust the models of classes are put forward, and good results are obtained.(4)A clustering method that combines Optics and K-Nearest is put forward, which is an auxiliary means to mining potential information from Web pages. Because Web pages are non-structured data and their features are sparse, the vector space is much more complex. As single points in set, the distribution among web pages is also tanglesome: densities are different among clusters and the shape of clusters are irregular. Optics can recognize high density area while perform poor on sparse points. Thus it could be used to construct initial clusters with density areas and pick up other points by K-Nearest classifier.

Keywords/Search Tags:

Digital Olympic, Web Page Filtering, Feature Selection, Adaptive Classification, Clustering based on Density

PDF Full Text Request

Related items

1	Preliminary Research On Classification And Clustering Of Chinese Web Page Involved In Intelligent Search
2	Research On Web Page Classification Algorithms Of Professional Theme
3	Study On Web Data Processing Technology
4	Chinese Web Page Classification Based On Web Page Features
5	Research On Feature Selection In Web Page Classification
6	Research And Implementation Of Content Oriented Web Page Classification
7	The Research And Implementation Of One Kind Of Web Page Filtering Method Based On Real-Time Network Traffic Data
8	A Research On Statistic-based Classification Of Chinese News Web Page
9	Chinese Text Classification Based On Svm Algorithm Realization
10	An On-line Ceramic Tile Classification System Using Adaptive Feature Selection