Web Information Crawling Applied In Fabric Textile Public Service Platform | | Posted on:2008-03-04 | Degree:Master | Type:Thesis | | Country:China | Candidate:C R Bo | Full Text:PDF | | GTID:2178360215962594 | Subject:Detection Technology and Automation | | Abstract/Summary: | PDF Full Text Request | | With the expansion of china's export of fabric production, China has become the main textile fabric market. It is more and more important for textile enterprise to get the information through internet and obtain the advantage of utilizing information source. In fact, 87.5% of the whole fabric textile enterprises are medium & small fabric textile enterprises. Due to the low level of the employee and lack of fund in these enterprises, it is difficult for them to obtain high quality information through internet. To solve the problem above, fabric textile public service platform is established. The platform integrates all kinds of information in the chain of textile industry to provide the textile company, especially the medium and small textile company, and provides latest information of raw materials, manufacturing, and production.The paper provides an approach for automatic information extraction and classification from the internet based on textile subjects. Approaches in this paper have been applied in the fabric textile public service platform. The work flow is as follows. firstly, use web crawler to crawl web page and download to the local server; secondly, scan the source code of the web page, and analyze the characteristics of the structure of the web page; thirdly, extract URLs and real information related to the textile subjects, and then put the URLs and real information into the temporary memory; at last, extract the information from temporary memory and categorize them based on the predefined classifications.To raise the efficiency of the web crawler, the system uses the coordination tool to manage the web crawler to avoid unequal resource distribution arising due to the load imbalance. On the other hand, while scanning the page source code and getting the URL, the system uses the subject link prediction model and subject link filter model to judge the URL's type. The principle of subject link prediction model and subject link filter model is gathering related link and abandoning non-related link. It can reduce the work load of web crawlers. The classical vector space model dose not take into account the fact that different distribution of characteristic items represent different values. The system improves the calculation formula of characteristic item, which can reflect the structure of web page effectively. It is complicated for extraction rules based on document structure. It can be easily disturbed by the data with similar structure for methods based on characteristic pattern matching. In order to increase the accuracy and portability of information extraction, by combining document structure with characteristic pattern matching, the system got a good performance. It is hard for the classic KNN to get the global optimum search when the training set is huge. In order to accelerate the search of KNN, the system improves the text categorization of KNN. The results show that the system can provide a quick search among the mass data set. | | Keywords/Search Tags: | web information crawling, web service, crawler, vector space model, information extraction, text classification, knn | PDF Full Text Request | Related items |
| |
|