Font Size: a A A

Research On Rough Classification Of Academic Papers Based On Topic And Semantic Fingerprint Fusion

Posted on:2019-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:T T CuiFull Text:PDF
GTID:2428330545958820Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet,Internet of things,cloud computing and other information technology have brought us to the era of network and big data.However,a large amount of resource sharing and real-time data communication make the data in the network space be explosively growing.The data,with huge scale and varied form,not only bring pressure to the storage capacity of network space,but also reduce the utilization density of data value,and present the embarrassment of"rich in data but lack of knowledge".How to compress the data storage and how to find the satisfactory information from huge network space have become an urgent problem.In this dissertation,academic papers were selected as the data object,and a text fingerprint extraction method and a text rough classification algorithm were proposed to achieve the purpose of data compression and effective organization and management.Firstly,a text fingerprint extraction method based on latent semantic analysis was proposed.This method is an improvement fingerprint extraction method to overcome the shortcoming of semantic deficiency.Semantic fingerprint for main text of academic paper was extracted by the method and the latent semantic features of original document was excavated by singular value decomposition,and then the reserved semantic feature was converted into binary digital fingerprint based on the principle of random hyperplane.The fingerprint is the low-dimensional representation of the high-dimensional original document.Secondly,a rough classification algorithm based on fusion representation was designed.The algorithm,an improved K-means clustering algorithm,is mainly based on the fusion representation of two parts of academic paper:paper outline(title,abstract,keywords)and main text.In this algorithm,each document was represented by topic vector and semantic fingerprint.The center of each cluster in each clustering iteration procedure was assigned a real document in the dataset,which was determined as a prototype of the original document set.In addition,the algorithm used the method of cosine distance and Hamming distance to calculate the fuzzy membership degree of the documents relative to various centers,and the documents were classified into the categories with the maximum membership degree,and the rough classification of the data sets was completed.Finally,in order to provide favorable information for subsequent retrieval and other operations,a prototype based document classification algorithm was designed.According to the similarity between external document and prototypes,the algorithm determines whether to classify the external document or not and which category to fall into.The experiment results show that the proposed method of text fingerprint extraction based on latent semantic analysis was more accurate than the common vector space model representation method and Simhash method,and it can reflect the semantic information of the text more effectively.In addition,the rough classification method based on fusion representation has solved the large cluster problem of the original K-means clustering algorithm,the F value of the documents in each field of the data set can reach more than 80%,and the better class structure can be obtained.The external document classification method based on the prototype has a high accuracy in the same field with prototype,and has a high reject rate for other document in other fields.It can identify the external documents correctly,achieves the purpose of rough classification,and is beneficial to the organization and management of document set.
Keywords/Search Tags:text representation, semantic fingerprint, text clustering, latent dirichlet allocation(LDA), latent semantic analysis(LSA), Simhash algorithm
PDF Full Text Request
Related items