Font Size: a A A

Relevant Words Based On Word To Vectors And The Application In Topic Crawler System

Posted on:2018-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z MengFull Text:PDF
GTID:2348330515468003Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The crawler,which automatically get the content from web,is more and more popular.It is not only an important part of search engine,but also one of the important methods to get the corpus in supervised machine learning model training.However,in certain areas of research,general crawler no longer meet the needs of a particular corpus,so vertical domain crawlers with specific topics are increasingly needed.Topic crawlers determine whether to crawl web links,by judging the semantics between the topic and the link page.In this paper,the word vectors for semantic representation is used,and join the pointwise mutual information method,to judge the new web link,and decision continues to crawl the page,or abandon the crawling page.The details are as follows.Introduce the Natural Language Processing technology,deep learning technology,language model.And two methods of word vector representation based on matrix and word to vectors are introduced in detail.Then train the models with Chinese Wikipedia corpus.Take some experiments,and select a group of parameters of the following chapters.To solve the problem of polysemy,bring in the pointwise mutual information.Judge the meaning of the word according to context information.And through the last part of the conclusion,joined PMI to take experiments.And describes how to solve the memory only 8G of the computer can not hold the corpus problem.Take the above two parts into the vertical field crawler system.The crawler system take the breadth first search method.When the crawler system encounters a new link,it determine the degree of relevance between the link word and the topic word,by the model derived from the previous section.This part take experiments by three topic words: "programmer","furniture","skin care".The experiments crawl on Baidu Baike,each topic keep severalrelevance pages,save in database.And also keep all links include abandoned links,save in log.Then calculate the accuracy and recall rate,etc.by the log.And compare with the ordinary crawler of without words related to technology.So the judgment of effect in this paper is more objective.This paper proposes a semantic model representation by word to vectors,joined the pointwise mutual information,to determine whether they are relevance,in order to find out the key words and related web links,and obtains the experimental objective effect.
Keywords/Search Tags:Vertical Topic Crawler, Semantic Representation, Word to Vectors, PMI
PDF Full Text Request
Related items