Font Size: a A A

Research On Scientific Document Clustering And Topic Evolution Based On Citation Networks

Posted on:2020-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y B XuFull Text:PDF
GTID:2417330572966784Subject:Management statistics
Abstract/Summary:PDF Full Text Request
Recently,with the number of published scientific documents increasing exponentially,it is difficult for researchers,especially for novice researchers,to follow the research front by human effort only.Therefore,how to find valuable scientific documents,understand the current research situation in the field,and determine the topics evolution,has become a challenging task.Bibliometric analysis can quantitatively analyze scientific documents through mathematical and statistical methods,which provides a practical approach to solve this problem.In addition,scientific document clustering can aggregate similar scientific documents according to their citation networks and textual similarities,which helps researchers understand the current research situation,research front,and topic evolution more quickly and accurately.Scientific document clustering is of great significance for researchers to carry out their future research.Therefore,it is an important research direction in bibliometric analysis.For two different types of databases(i.e.,full text databases and summary databases),this paper proposed two different approaches to calculate scientific documents' similarities,cluster scientific documents,and identify the research front and topic evolution based on the existing research achievements such as bibliometric analysis,citation network,text mining,and statistical analysis.The main contributions of this paper are concluded as follows:1.For the full text database,such as PMC and Pub Med databases,based on scientific documents' citation networks,this paper proposed a new method to calculate documents' similarities so that the accuracy of scientific document clustering can be increased,which considers the location of references cited in a scientific document and documents' textual similarities.The proposed method is composed of three parts: a)Based on the assumption that the more similar a scientific document and its corresponding references are,the higher the frequency and the wider the distribution of the references cited in the scientific document,this paper considered the number and location of a reference cited in a scientific document,and extended the traditional direct citation network,which reflects the similarities between references and scientific documents more accurately.b)Based on the assumption that the more the similarity between references cited in a scientific document,the closer they are cited in the scientific document,this paper considered references' proximity as a key factor in calculating scientific documents' similarities,and extended the traditional co-citation network,which reflects the similarities between references cited in a scientific document more accurately.c)To reduce the workload in calculating scientific documents' textual similarities,this paper utilized scientific documents' abstracts rather than the full texts to calculate the textual similarities;to increase the accuracy of scientific document clustering,this paper integrated scientific documents' direct citation network,co-citation network,bibliographic coupling network,and textual similarities.Finally,this paper used the above approach to cluster 10,966 scientific documents and their corresponding references in the field of oncology,and proved that the proposed method can obtain reasonable clustering results by comparing it with the traditional methods,according to the indices of precision,recall,and F1-score.2.Because it is impossible to extract the number of times a reference cited in a scientific document or extract the position where a reference cited in the scientific document in the summary databases such as Web of Science(WOS)database,this paper presented a new approach to identify research front and topic evolution based on scientific documents' citation networks and Page Rank algorithm.The proposed approach is made up of three stages: a)Dividing scientific documents into several time windows according to their years of publication,calculating similarities between them according to their citation networks,and clustering them in each time window.b)Based on the assumption that the more important a scientific document in the cluster is,the greater the possibility that it is cited by the other documents in the same cluster,this paper used Page Rank algorithm to rank scientific documents in the cluster,then used keywords' frequency to detect the clustering theme.c)Constructing the cluster network where nodes represent clusters and edges' strengths represent the similarities between different clusters,then detecting research front and identifying topic evolution based on the constructed cluster network.Finally,this paper used the above approach to cluster 19,005 target scientific documents and the documents that cite them or are cited in the field of data mining.The experiment's results show that the presented approach can obtain reasonable clustering results,and it is effective for research front detection and topic evolution.In order to reflect the similarities between scientific documents more accurately,based on scientific documents' citation networks,this paper designed two different approaches to calculate the similarities of scientific documents that are collected from PMC,Pub Med,and WOS databases,respectively.In addition,this paper identified the research front and topic evolution based on the results of scientific document clustering.The proposed approaches can help researchers to find the valuable papers,understand the current situation and future development,and support their future research.
Keywords/Search Tags:bibliometric analysis, citation network, text mining, scientific document clustering, topic evolution
PDF Full Text Request
Related items