Font Size: a A A

The Design Of A Search Engine For Textile Based On Web Mining

Posted on:2009-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:J CaoFull Text:PDF
GTID:2178360242472842Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Nowadays more and more information is explored on the web. It is difficult for web surfer to find what they need without the help of powerful search engines. Those famous universal search engines, such as Yahoo, Google, and Baidu, often offer more than you want when one only cares for related information of textile, by giving several keywords to those search engines. Because they have no ideas which fields you care, they give you all the pages involving all topics, which contain these keywords. That is why we want to design specialty-oriented search engine to help those who want the retrieved pages to be textile related.In this thesis, we present an architecture of a specialty-oriented search engine for textile industry, and give the design of its essential modules, as well as study key technologies underlying them. Our work includes: (1) Design of the framework of the topic web crawler. To raise the efficiency of the web crawler, the system uses the coordination tool to manage the web crawler to avoid unequal resource distribution arising due to the load imbalance. On the other hand, while scanning the page source code and getting the URL, the system uses the subject link prediction model and subject link filter model to judge the URL's type, finding whether the corresponding page is related with textile. The principle of subject link prediction model and subject link filter model is to gather related link and abandon non-related link. It can reduce the load of web crawlers.(2) Modifying the technology of text classification which is used in the topic web crawler. The classical vector space model does not take into account the fact that different distributions of characteristic items represent different values. The system improves the calculation formula of characteristic item, which can reflect the structure of web page effectively. It is hard for the classic KNN to get the global optimum search when the training set is huge. In order to accelerate the search of KNN, the system improves the text categorization of KNN. The results show that the system can provide a quick search among the mass data set.(3) Improving the performance of the retrieval system by a new page ranking algorithm. There are many researches on the famous ranking algorithms including PageRank and HITS. In this paper, we put forward a ranking algorithm based on Web mining after analyzing these two algorithms, hoping to find the balance between supplying important pagesof textile related topics and covering all the branches in the field oftextile.
Keywords/Search Tags:web topic crawling, web crawler, vector space model, text classification, K-nearest neighbor algorithm, ranking algorithm
PDF Full Text Request
Related items