Font Size: a A A

Research On Topic-related Data Source Identification Technology Based On Web

Posted on:2020-06-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y K WuFull Text:PDF
GTID:2428330599451298Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The Internet has developed rapidly since its birth,and the data has shown an explosive growth trend.The data on the Internet has unique characteristics,and most of the data in the Internet is unstructured text.The data of different themes are scattered on many different nodes of the Internet,which brings great difficulties to the effective use of Internet data.User demand for data sources is often topic-oriented,and the emergence of search engines allows people to submit queries to search engines to obtain data sources for a topic.However,search engines often submit queries in the form of keywords,and a single keyword cannot accurately represent a topic.At the same time,using a single topic keyword to search will return a large number of data sources unrelated to the query topic,requiring users to spend a lot of time filtering the returned results.How to quickly identify a large number of data sources returned by search engines has become a hot research problem.This paper analyzes the existing data source identification methods and finds that the existing data source identification methods only consider the relevance of the data source content and the query,and the correlation between the data source and the query is related to many factors.This paper proposes a method to combine the data source document quantity,data source authority,and data source theme to calculate the data source and query relevance to identify the relevant data source of the topic.The main contributions are as follows:(1)An integration framework for obtaining topic-related data sources based on the web is proposed.In view of the problem that a single general search engine returns a low coverage rate and a large amount of data for a specific topic query,this paper integrates multiple search engines to perform topic data source query to improve the recall rate of search results.This paper obtains topic-related data sources by submitting several topic query words and query word weights to the integration interface,collecting data sources returned by different search engines,merging data sources,calculating data source and query relevance,and topic-related data source ordering.Query based on the topic-related data source based on the framework not only improves the recall rate but also improves the precision.(2)A method for constructing a theme-related word set based on web is proposed.In order to solve the problem of expanding the query topic words in data source identification,this paper analyzes the existing keyword extraction work and finds that the relevant methods of existing keyword extraction mainly focus on a specific document to extract keywords,which cannot be directly applied to extract a topic-related word set.This paper proposes a method for constructing topic-related word sets based on domain expert knowledge and large-scale web data information.The topic-related word sets obtained by this method can be applied not only to topic query words but also to query texts.(3)A method for identifying topic-related data sources based on web is proposed.In view of the problem that existing data source identification methods only consider a single factor,this paper first submits the subject query words to different search engines to obtain the data source,and obtains the external impact factors of the data source through the ranking of the data source in different search engines and the weight of the subject query words.Then the topic probability distribution of the data source and query is obtained and the similarity between the topic probability distributions is calculated.Combined with the external impact factors of the data source and the similarity of the subject probability distribution between the data source and the query,the correlation between the data source and the user query is calculated and sorted according to the degree of correlation.Finally,the data source with high correlation is selected as the return result.The feasibility of this method is verified by experiments in this paper.
Keywords/Search Tags:Data source identification, Topic model, Relevance, Keywords, Theme-related word set
PDF Full Text Request
Related items