Font Size: a A A

The Research And Design On URL Classification Algorithm Based On Behavior Recognition

Posted on:2011-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:J R LiuFull Text:PDF
GTID:2178360308462399Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the surprisingly rapid development of the Internet, web has contained massive dynamic, semi-structural or non-structural information, and more than 80 percent appears in text form. URL classification is the base technology of search engine, URL filtering, web information management, so it becomes more and more important. The algorithm is an important step in the process of URL classification. It has direct impact on the classification results. URL classification techniques based on text classification, but it is different from the plain text classification techniques. Because Web page has "noise" information as well as it is semi-structured.Text classification is build up by the text preprocessing, feature thesaurus creation, the text classifier, and testing of text classification results. This paper focuses on search engine optimization has leading effect on website design, and based on that, behavior analysis of web pages is carried on. Then put forward a brand new URL classification algorithm-URL Classification Algorithm Based on Behavior Recognition. The main researches in this paper are as follows:During the study of behavior found in web pages, web design is impacted by search engine optimization. In order to improve the site's search rate, website designers make use of Meta tags to express the site theme. Website Meta tags has great contribution on reflecting the theme. And in this semi-structured text structure, the vast majority of web pages contain title, keywords, description, subtitle, etc. The new algorithm is exploited by this behavior character.Behavior-based recognition algorithm for URL classification fully takes the linguistic of web page's text encoding diversity into account. And the interference brought by the distinction between the languages is greatest possible eliminated.In this paper, the test of the algorithm is finished, and also the comparison with foreign similar products. The algorithm accuracy rate and the recall rate can both reach 90%, which reflects a good classification performance.This paper finally gives the realization of the algorithm, and testing tools for implementation. The program currently has 40 million of the URL to classify, including Chinese, English, Russian, German, French and other in eight languages, and reflecting a good performance.
Keywords/Search Tags:Behavior recognition, Text category, URL classification, Web search engine optimization
PDF Full Text Request
Related items