Font Size: a A A

Research On Key Technology Of Scientific Literature Data Mining

Posted on:2016-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:M Y LiFull Text:PDF
GTID:2348330542473911Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid increase of the number of scientific literatures,the development and evolution of scientific knowledge become more and more quickly.It is very difficult for researchers to grasp and understand the informations quickly.Therefore,how to discover the literatures which have higher value of reading from a large amount of scientific literatures has attracted a lot of attention from more and more researchers.Citation count refers to the total number of citations which is obtained by a scientific literature in a specified period of time.Citation count is an important method to evaluate the influence and quality of scientific literatures.But it has many limitations to analysis the citation count,such as the current time point.Based on these circumstances,it is a challenging task to get the citation count in the future,which will has a bad effection on the assessment of secientific literatures' contribution.In order to identify the potential literatures quickly and promote the dissenmination of new knowledge,a method which can predict the citation count automatically and exactly is needed.This paper focus on the algorithm which is used to prediction citation count of scientific literatures.The research details of this paper are as follow: Firstly,we present a improved algortihem for the citation count prediction task in the international top competition on data mining which is named by KDDCUP.Compared with the algorithm of the team in the first place,we analysis the topic words of literatures in the dataset.Then we cluster the literatures according to their topic words,do regression forecast in each class in order to reduce the impact cause by the differences of each topic on academic activity.Experimental analysis shows that the improved algorithm can improve the prediction accuracy compared with the original algorithm.Based on our findings about the shortage of existing algorithms,this paper propose a new citation count time series predicting algorithm and evaluate it using the real citation data.This algorithm is based on the similarity of citation pattern,using time-series regression modeling and similarity clustering data mining technology.On one hand,our algorithm can analyze the citation count of each literature in the dataset automatically and get the averagecitation count in each month.On the other hand,we also mine the different citation patterns by similarity clustering,so we can predict the citation count based on the existing citation count time series.Analytical and simulation results show that our prediction algorithm can achieve higher accuracy.
Keywords/Search Tags:citation count prediction, time series, cluster analysis, regression forecast
PDF Full Text Request
Related items