Font Size: a A A

The Research On Categorization Of Chinese Web Page In Intelligent Search

Posted on:2012-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:K DengFull Text:PDF
GTID:2178330335967021Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The scale of network information increased exponentially along of the rapid development of the Internet technology. In order to search the information more quickly, but not to be flooded by large, disordered and various types of structural information, intelligent search has become the main way to retrieve information. However, there are many defects with search engine itself to result in problems, such as information silos, the theme of bias.We may largely meet the needs of users, if we classify the search information according to their categories which are belonged to, automatic classification for web pages was born.Currently, Web classification technology has been widely used in directory navigation service of search engine, information filtering, topic search, personalized information retrieval, service of pushing active information, etc. The main research- ed contents include the following aspects:Firstly, by analyzing the"noise"of web page and combined the visual information and geometric layout of content , the improved DOM tree of web page and visual analysis methods is used to identify content blocks, and statistical methods are used to remove the irrelevant contents, finally removed the noise effectively. Experiment shows that the method can largely purify the text of web page and remain the relevant info.Secondly, as similarity calculated by Cosine distance did not consider the semantic information between terms of text. A text similarity method based on optimal assignment is proposed using the concept of lexical semantic defined by《How Net》. In this method, the model gathered contribution of semantic similarity of each feature in page in order to make accurate similarity value, then maximum similarity value between texts to be got.Finally, after researching general model of web page automatic classification and combining the definition of web page categories, a kind of hierarchical classification model based on support vector machine is constructed. In this model, support vector machine algorithm is used to identify all of the top-level categories, then selecting features at second time to remove the similarity characteristics between sub-level categories, and continuing using K-NN to identify sub-categories in each top-level. Experiment shows that this method of hierarchical classification has got good results.
Keywords/Search Tags:Intelligent search, Web page classification, Page purification, Hierarchical classification
PDF Full Text Request
Related items