The Research On Categorization Of Chinese Web Page In Intelligent Search

Posted on:2012-03-13

Degree:Master

Type:Thesis

Country:China

Candidate:K Deng

Full Text:PDF

GTID:2178330335967021

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The scale of network information increased exponentially along of the rapid development of the Internet technology. In order to search the information more quickly, but not to be flooded by large, disordered and various types of structural information, intelligent search has become the main way to retrieve information. However, there are many defects with search engine itself to result in problems, such as information silos, the theme of bias.We may largely meet the needs of users, if we classify the search information according to their categories which are belonged to, automatic classification for web pages was born.Currently, Web classification technology has been widely used in directory navigation service of search engine, information filtering, topic search, personalized information retrieval, service of pushing active information, etc. The main research- ed contents include the following aspects:Firstly, by analyzing the"noise"of web page and combined the visual information and geometric layout of content , the improved DOM tree of web page and visual analysis methods is used to identify content blocks, and statistical methods are used to remove the irrelevant contents, finally removed the noise effectively. Experiment shows that the method can largely purify the text of web page and remain the relevant info.Secondly, as similarity calculated by Cosine distance did not consider the semantic information between terms of text. A text similarity method based on optimal assignment is proposed using the concept of lexical semantic defined byã€ŠHow Netã€‹. In this method, the model gathered contribution of semantic similarity of each feature in page in order to make accurate similarity value, then maximum similarity value between texts to be got.Finally, after researching general model of web page automatic classification and combining the definition of web page categories, a kind of hierarchical classification model based on support vector machine is constructed. In this model, support vector machine algorithm is used to identify all of the top-level categories, then selecting features at second time to remove the similarity characteristics between sub-level categories, and continuing using K-NN to identify sub-categories in each top-level. Experiment shows that this method of hierarchical classification has got good results.

Keywords/Search Tags:

Intelligent search, Web page classification, Page purification, Hierarchical classification

PDF Full Text Request

Related items

1	Research And Implementation On A Web Page Classification System
2	Research And Implementation Of Chinese Web-page Classification Based On Web Data-mining
3	The Study And Implementation On The Key Problems Of Intelligent Search Engine Technology
4	Preliminary Research On Classification And Clustering Of Chinese Web Page Involved In Intelligent Search
5	Internet Web Page Automatic Classification Techniques
6	Hierarchical Classification For Chinese Web Page Based On Improved SVM-KNN
7	Semi-supervised Web-page Classification And Its Application In Directory-style Search Engines
8	Research Of Web Page Purification And Replicas Detection In Search Engine
9	Chinese Web Page Classification Based On Web Page Features
10	Research And Implement Of Topic Oriented Web Page Classification Technique