The Design Of A Search Engine For Textile Based On Web Mining

Posted on:2009-08-20

Degree:Master

Type:Thesis

Country:China

Candidate:J Cao

Full Text:PDF

GTID:2178360242472842

Subject:Control theory and control engineering

Abstract/Summary:

Nowadays more and more information is explored on the web. It is difficult for web surfer to find what they need without the help of powerful search engines. Those famous universal search engines, such as Yahoo, Google, and Baidu, often offer more than you want when one only cares for related information of textile, by giving several keywords to those search engines. Because they have no ideas which fields you care, they give you all the pages involving all topics, which contain these keywords. That is why we want to design specialty-oriented search engine to help those who want the retrieved pages to be textile related.In this thesis, we present an architecture of a specialty-oriented search engine for textile industry, and give the design of its essential modules, as well as study key technologies underlying them. Our work includes: (1) Design of the framework of the topic web crawler. To raise the efficiency of the web crawler, the system uses the coordination tool to manage the web crawler to avoid unequal resource distribution arising due to the load imbalance. On the other hand, while scanning the page source code and getting the URL, the system uses the subject link prediction model and subject link filter model to judge the URL's type, finding whether the corresponding page is related with textile. The principle of subject link prediction model and subject link filter model is to gather related link and abandon non-related link. It can reduce the load of web crawlers.(2) Modifying the technology of text classification which is used in the topic web crawler. The classical vector space model does not take into account the fact that different distributions of characteristic items represent different values. The system improves the calculation formula of characteristic item, which can reflect the structure of web page effectively. It is hard for the classic KNN to get the global optimum search when the training set is huge. In order to accelerate the search of KNN, the system improves the text categorization of KNN. The results show that the system can provide a quick search among the mass data set.(3) Improving the performance of the retrieval system by a new page ranking algorithm. There are many researches on the famous ranking algorithms including PageRank and HITS. In this paper, we put forward a ranking algorithm based on Web mining after analyzing these two algorithms, hoping to find the balance between supplying important pagesof textile related topics and covering all the branches in the field oftextile.

Keywords/Search Tags:

web topic crawling, web crawler, vector space model, text classification, K-nearest neighbor algorithm, ranking algorithm

Related items

1	Research On Web News Topic Detection And Tracking
2	Text Classification Algorithm Based On Chinese And English Topic Space
3	Based On Text Classification, Topic Tracking And Application Of A Grammar Model
4	Research On The Topic Crawler Algorithm Based On Vector Space Model
5	Design And Implementation Of The Technical Text Categorization System
6	Research On Topic Tracking System Based On Keywords
7	Research On Improved K Neighbor Support Vector Machine Algorithm Faced Text Classification
8	Web Information Crawling Applied In Fabric Textile Public Service Platform
9	Study Of Text Classification Algorithm Base On Clustering Algorithm And Support Vector Machine Algorithm
10	Improved Word Embedding And K-nearest Neighbor Algorithm For Chinese Text Classification