Font Size: a A A

Research On Source Discovery Technology Based On Website Feature Analtsis

Posted on:2020-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:L L ZhangFull Text:PDF
GTID:2428330605978913Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the explosive growth of information contents on the Internet,obtain information of personal interest dynamically from mass information has become one of the research hotspots currently.At present,people mainly used search engines for related information.The search engines usually feedbacks a list of related webpages based on keywords,but the source of webpages is always intricate,so find professional websites or columns closely related to the topic based on webpages(this article referred to as "source")is the focus of this topic.Compared with webpages retrieval,websites or columns tend to be thematic,high informative and dynamic updated,which are more in line with the research needs of scientific and technological personnel.This paper proposes a source discovery algorithm which based on website feature analysis.That is,through the web page retrieval,content cleaning,correlation analysis,webpages source analysis,source site or column feature extraction,evaluation recommendation,and other links to achieve user-requested websites or columns independent discovery and retrieval ranking.In this research,we use the websites or columns as the main source of information research.Website feature selection and correlation calculation algorithm are the focus of this paper.This paper proposes a website feature extraction algorithm based on the combination of website structural features and content features,and combines the BM25(Okapi Best Match25)algorithm and cosine distance to calculate the degree of relevance.At the same time,it considers the importance of the website and the frequency of updates to evaluate the importance of the website.Finally,the high-scoring new website or column information is fed back to the user everyday,so as to achieve the purpose of automatic source discovery.Experiments show that the method could make full use of the structural features and content semantic features of different websites to effectively realize the search and discovery of website sources.In order to continuously improve the accuracy of source discovery,the system could dynamically optimize the ranking results in combination with the user's implicit feedback behavior when browsing related websites.For improve the efficiency of source discovery,the distributed file system and distributed computing architecture are used in the implementation.
Keywords/Search Tags:Source Discovery, Website Features, Customer Feedback, Dask Distributed Computing Library
PDF Full Text Request
Related items