Font Size: a A A

Research Towards Web Classification Based On Wikipedia Category Network And URL Pattern Tree

Posted on:2014-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:L B LaiFull Text:PDF
GTID:2248330392961049Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Classifcation is a major problem in the feld of Information Retrieval.Web clas-sifcation which targeting on classifcation problems in web pages, therefore, plays asignifcant role in web mining. Most web applications rely on accurate classifcationto improve the quality of services. Such application includes the maintenance of webdirectories, search engine, page crawler, recommendation system, user profle systemand online advertising. Many research eforts have been devoted into these areas asto fnd out solutions for specifc issues, accompanying with the purpose of highly ef-cient category mechanisms, among which content-based classifcation method is moststraightforward and meanwhile undergoes most sophisticated discoveries. Content-based method attempts to extract main text from web pages and relates the web clas-sifcation problems to traditional problems within pure texts. It often comes down tothe Bag of Words or the TF-IDF model. Depending on main text, the accuracy ofclassifcation also surrenders to the quality of main text. A lower quality main text willdefnitely result in a lower category prediction.As the founding of more and more large-scale online category networks, classif-cation methods based on these third-party databases attract some academic attentionsrecently. These databases provide state-of-art semantic networks, and can serve as, onone hand, the auxiliary information to improve the performance of traditional classif-cation system, on the other hand, the core units for classifcation system itself. Givingthe concrete semantic networks, such method can to some extent compensate for thedisadvantage of short or low-quality main text. Furthermore, it emancipate categoriza-tion from large training set, which leads to high-efciency.In this paper, we encounter with the whole network context, where data is twisted with large amount of noise and interference. The traditional content-based method,will frstly lose its faith in prediction accuracy when dealing with some low-qualitydata, and secondly, fall into a collapse facing the elephantine training set.Therefore, we proposes a Wikipedia Network based classifcation model. TheWikipedia network benefts from its grandiose category system as well as the corre-sponding semantic relationships. What’s more, owe to the human labor from all acrossthe globe, the Wikipedia is still expanding its contents. Since then, Wikipedia networkcan provide a wide coverage for topics from all across the network. One characteris-tic that stands out, is that such method does not rely on the training set to obtain theprediction model, which benefts from two aspects. Firstly, as previously mentioned,it can handle the big data better. Secondly, the category network is relatively stablein a long run, which guarantees the long-term validity of the model. In the evaluationsection, we provides several comparisons with the traditional methods as to prove itsfeasibility.Additionally, we proposes a new method based on URL pattern tree towards hostfunction classifcation. Analog with the grammar tree kernel using in neural languageprocess, we construct the “URL grammar”and the “URL grammar tree”. Witha slight modifcation of the original tree kernel, we can use the new kernel to makeprediction.
Keywords/Search Tags:Web Calssifcation, Wikipedia Network, URL Pat-tern Tree, Big Data
PDF Full Text Request
Related items