Font Size: a A A

Research On Similar Research Project Documents Search System Based On Text Feature Selection

Posted on:2019-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:J HuangFull Text:PDF
GTID:2428330563490929Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,huge volumes of data have been generated during the informatization on various fields.It is important to find a way to make use of these data.Relational databases can organize and manage structural data but not unstructured data.In the process of university scientific research management informatization,a great number of documents have also been generated and have not been taken advantage of.To solve the problem mentioned above,an improved feature selection method is proposed in this thesis,which can efficiently select eigenvectors from the document set.Moreover,a similardocument search system designed for massive-document scenario is implemented by using the nearest neighbor search technology.The work of this thesis includes:(1)Aiming at the shortcoming of traditional feature selection of documents,an improved feature selection method is proposed.Based on fusion of synonyms,TF-IDF-ICD(TF-IDF with Inter-Category Distributions)algorithm is proposed innovatively,which uses the ICDT(Inter-Category Distributions of Term-frequency)and the ICDD(Inter-Category Distributions of Document-frequency)to measure the degree of association between items and document categories.DR(Dimensionality Reduction)based on TF-IDF-ICD is applied to confine the dimension of eigenvector spaces by preserving the key features which have high value of TF-IDF-ICD.Experiments show that the feature space dimension and the storage space of the text eigenvectors can be reduced by this method,on the premise of ensuring the accuracy of classification,which is very suitable for the application scene of the research project documents.(2)Eigenvectors of research project documents are obtained by feature selection method,and a nearest neighbor index structure based on eigenvectors of documents is constructed using the vector nearest neighbor search technology.The index structure is a binary tree,which is constructed according to the distance between eigenvectors of documents.The main idea is that the closer the distance between eigenvectors of documents in the index tree is,the more similar the two documents are.Similar-document search function of research project documents can be provided by the system through this index structure.Moreover,appropriate storage method are designed to store eigenvectors of documents and document vector nearest neighbor index structure in the database,providing similar search services for multiple nodes at the same time,thus increasing the efficiency in high concurrency conditions.This system provides functions allowing researchers to locate research documents similar to their own research documents quickly and accurately,thus greatly improves the using value of research documents and brings convenience for scientific research management.
Keywords/Search Tags:Synonym Fusion, Feature Selection, Nearest Neighbor Search
PDF Full Text Request
Related items