Font Size: a A A

Research On The Independent Learning Technology Of Web Crawler Based On Web3.0

Posted on:2016-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:S F PiaoFull Text:PDF
GTID:2298330467997457Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of modern Internet from Web2.0to Web3.0era, the searchengine has become an essential method of collecting information from the Internet.Therefore, it is urgent to find the way to use the search engine more efficiently, on thepurpose of obtaining particular or more valuable information. Under this condition,this technology system is aimed at analysing how to acquire information regardingcustomer resources and their specific characteristics.This system is a sub-module of the sales team’s intelligent management system,which exclusively provides the customer-search module with customer resources. Thecustomer-search module is mainly used to provide users with company’s customerinformation collected from the Internet, which is one of the largest differences fromtraditional sales software. Usually, the traditional software can only contain limiteduseful information, because their resources are obtained with either a long-termaccumulation or the use of online yellow pages, such as Alibaba.com and HC360.com.As the search engine has become the most important information source in people’sdaily life, it is necessary to make it more efficient. Nowadays, there are various searchengines existing and each of them serves as an important online information channel.Consequently, this dissertation introduces the concept of meta-search engine andintegrates different kinds of search engines, in order to have as many informationchannels as possible. When entering a keyword in the meta-search engine, the usercan actually obtain results given by all available search engines simultaneously.Hence, at the same time of being convenient to use, the system can provide valuableinformation as much as possible. As well, an information-filtering feature has beenalready embedded, to avoid the result repetition. Furthermore, this dissertation hasoptimized the input keywords. To be detailed, models have been constructed about thekeywords in the search domain, using the ontology of cnki.net. As a representation of domain knowledge, ontology is also an important part of semantic web which is seenas the next-generation network, Web3.0. Also, acquiring necessary information withWeb3.0has been explored. Meanwhile, the stool of crawling information with searchengines has been studied, which is called web crawler or web spider. Crawled datacan be divided into three main categories: company website information, platformwebsite information and others. All company data and part of platform data is whatwe need exactly, while the remaining irrelevant data can be simply discarded. To dothis classification, two of the most popular algorithms are adopted, namely the NaiveBayes algorithm and the k-Nearest Neighbours algorithm (KNN).In terms of both algorithms, text pre-processing is the first stage, in whichsemi-structured data is transformed to be structured data. Herein, the Chinese wordssegmentation process of the IK Analyzer jar is mainly utilized. Second,3statistics arederived: the frequency of each Chinese word being at a certain category, the numberof Chinese words in that category and the total number of Chinese words in trainingsamples. To adapt to the KNN algorithm, the value of TF*IDF (Term Frequencymultiplying Inverse Document Frequency) of each document is also needed. Then, thepre-processed data, namely the characteristic words, is divided into the training setand test set. The training set is used for model learning, while the test set is used toassess the performance of a specified classifier at the end.With classification algorithms realized by analyzing theories above, this dissertationhas achieved outstanding results in some particular information-seeking areas. Onaverage, the classification accuracy can reach more than80%. Hence, it is believedthat this study can not only meet people’s most daily demands, but also be applied insome other research about crawling valuable Internet information.
Keywords/Search Tags:Web3.0, Web Crawler, Ontology, KNN, Naive Bayes
PDF Full Text Request
Related items