Font Size: a A A

Research And Implementation Of Focused Crawler Based On Word2Vec

Posted on:2019-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:J SongFull Text:PDF
GTID:2428330548961162Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,the Internet has developed rapidly,web pages show an explosive growth trend,and the existence of massive pages has caused information overload.Classified directory websites and general search engines can help people find information from massive webpages,which satisfies people's broad search requirements.However,there are problems such as poor retrieval quality and excessive costs.To this end,researchers have proposed the concept of focused crawler,using algorithms to guide the crawling process of the crawler,avoid downloading irrelevant pages and getting information more efficiently and accurately.This thesis firstly elaborated the research background and the significance of the focused crawler,introduced the research status of focused crawler,and elaborated the general crawler and the focused crawler's architecture and crawling strategy through examples,analyzed and pointed out that the bottleneck of the focused crawler mainly lies in the topic representation,relevance comparison and crawling strategy.Next,the thesis introduced common topic representation methods,including ontology representation and keyword representation,and their shortcomings.The thesis put forward a method of using Word2 Vec to expand keywords so that users can describe topic more quickly and accurately.Afterwards,this thesis introduced some keyword extraction methods and relevance comparison methods in detail.In the relevance comparison aspect,to solve the problem of using simple keyword matching method for the Vector Space Model may lead to that some topic related pages are judged to be irrelevant to the topic,the thesis proposed a text relevance comparison model based on Word2Vec(TRCW),and has done a lot of comparative experiments under the five topics of NBA,military,entertainment,technology and finance,the results show good.Next,this thesis analyzed the Shark-Search algorithm in detail and pointed out that the boundaries of the anchor text context in the Shark-Search algorithm are difficult to determine,the anchor text at the edge of the forum is vulnerable to the negative effects of link anchor texts in other forums.To solve this problem,the thesis proposed that we should make full use of the semi-structured features of web pages,and set anchors in group as contexts.In addition,for the problem of tunneling in the focused crawler domain,this paper pointed out that due to the structural design of websites and other factors,web pages in the tunnel are always directory pages.This article has improved the Shark-Search algorithm based on this feature.We did some comparative experiments in the topic of entertainment and technology,the experiment shows that the improved method of this thesis greatly improves the accuracy rate compared to the Shark-Search method.Combined with the theoretical basis of the previous chapters,this thesis uses Python and PHP technologies build a focused crawler system with B/S architecture.Finally,we summarized and prospected the work done in focused crawler field.The experimental results show that the theoretical method proposed in this thesis is effective,and it is helpful to the research of the focused crawler field,provides new ideas and practical experience for future research in this field.
Keywords/Search Tags:focused crawler, Word2Vec, topic representation, relevance comparison, crawling strategy
PDF Full Text Request
Related items