Font Size: a A A

Design And Implementation Of Open Access Literature Mining Platform

Posted on:2015-12-03Degree:MasterType:Thesis
Country:ChinaCandidate:G YangFull Text:PDF
GTID:2298330467953765Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Recently, researchers have paid much attentions on Open Access (OA). More andmore manuscripts are published in OA journals. However, these open access resources aredispersed stored on various websites, it is difficult for us to directly retrieve all theseresources one time. Therefore, it is urgent for us to efficiently gather, manage and processthese OA resources with a convenient uniform. The fast development of text mining andmachine learning techniques play an important role in OA research theory and technique.Using these techniques mining the OA resources, scientists can easily catch the research hotspots, organize project proposal and seize the theory and technique opportunity. Then theycan achieve the great research and commercial values.Text mining is an important area in data mining. However, text mining is moreconcentrated on the semi-structured or unstructured data (such as txt, doc, HTML, etc.). Itis different from data mining, which mainly focuses on structured data in database.Facing the semi-structured and unstructured data, the performance of the traditional datamining algorithms is very limited. But text mining tools can extract more information anddiscover more knowledge from these data.In this paper, the related techniques including:(1)Text MiningText mining is a kind of technique that can extract useful information fromsemi-structured or unstructured data. Recently, text categorization, clustering and publicopinion analysis are most popular techniques.(2)Text ClusteringText clustering is an unsupervised process that classifies texts into groups calledclusters. In the same cluster, the similarity between texts is bigger, in the different clusters,the similarity between texts is smaller.(3)The detailed procedures of two classical clustering algorithms explored in theplatform, named as K-means and Affinity Propagation are introduced.K-means algorithm is raised by MacQueen in1967. Before its clustering, we need tospecify the count of the clusters. Now, it has been widely applied, and has been recognizedas the top10algorithms in data mining field in2008. Compared with K-means algorithm, Affinity Propagation algorithm is more efficient and high-speed which needn’t to specifythe count of clusters. Recently, AP algorithm has been applied in many fields (such as genediscovery, face recognition, codebook design, etc.).The related information and algorithms support a solid foundation for the design andimplementation of open access literature mining platform. Moreover, the analysis anddesign of the OA text mining platform are fully depicted. It mainly contains four modules:(1)Data collection. Its main purpose is to implement collecting and uploading OAresources, which obtained from the Ftp service and web crawler.(2)Pre-processing. This step could help to filter the noise for the following procedures.It normally contains word segmentation, stop words removing, frequency statistic, vectorspace construction and so on.(3)Similarity Computation. With the similarity definition between texts, thecorresponding similarity matrix is constructed. It plays a key role in text clustering. Thereare two classical similarities in our platform, named as: Euclidean distance and CosineCoefficient.(4)Text Clustering. To efficiently organize and process the OA resources, two classicalclustering algorithms K-means and Affinity Propagation are exploring in the platform.In addition, the physical architecture and logic architecture of the platform are alsodescribed in detail.
Keywords/Search Tags:Open access, text mining, text clustering, K-means algorithm, AffinityPropagation algorithm
PDF Full Text Request
Related items