Research On Similar Research Project Documents Search System Based On Text Feature Selection

Posted on:2019-04-07

Degree:Master

Type:Thesis

Country:China

Candidate:J Huang

Full Text:PDF

GTID:2428330563490929

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,huge volumes of data have been generated during the informatization on various fields.It is important to find a way to make use of these data.Relational databases can organize and manage structural data but not unstructured data.In the process of university scientific research management informatization,a great number of documents have also been generated and have not been taken advantage of.To solve the problem mentioned above,an improved feature selection method is proposed in this thesis,which can efficiently select eigenvectors from the document set.Moreover,a similardocument search system designed for massive-document scenario is implemented by using the nearest neighbor search technology.The work of this thesis includes:(1)Aiming at the shortcoming of traditional feature selection of documents,an improved feature selection method is proposed.Based on fusion of synonyms,TF-IDF-ICD(TF-IDF with Inter-Category Distributions)algorithm is proposed innovatively,which uses the ICDT(Inter-Category Distributions of Term-frequency)and the ICDD(Inter-Category Distributions of Document-frequency)to measure the degree of association between items and document categories.DR(Dimensionality Reduction)based on TF-IDF-ICD is applied to confine the dimension of eigenvector spaces by preserving the key features which have high value of TF-IDF-ICD.Experiments show that the feature space dimension and the storage space of the text eigenvectors can be reduced by this method,on the premise of ensuring the accuracy of classification,which is very suitable for the application scene of the research project documents.(2)Eigenvectors of research project documents are obtained by feature selection method,and a nearest neighbor index structure based on eigenvectors of documents is constructed using the vector nearest neighbor search technology.The index structure is a binary tree,which is constructed according to the distance between eigenvectors of documents.The main idea is that the closer the distance between eigenvectors of documents in the index tree is,the more similar the two documents are.Similar-document search function of research project documents can be provided by the system through this index structure.Moreover,appropriate storage method are designed to store eigenvectors of documents and document vector nearest neighbor index structure in the database,providing similar search services for multiple nodes at the same time,thus increasing the efficiency in high concurrency conditions.This system provides functions allowing researchers to locate research documents similar to their own research documents quickly and accurately,thus greatly improves the using value of research documents and brings convenience for scientific research management.

Keywords/Search Tags:

Synonym Fusion, Feature Selection, Nearest Neighbor Search

PDF Full Text Request

Related items

1	Research And Implementation Of Feature Fusion Tracking Algorithm Based On Nearest Neighbor Decision
2	Approximate Nearest Neighbor Search Algorithms And Their Application
3	Research On Hashing Accelerated Approximate Nearest-Neighbor Search
4	The Text Categorization Algorithm Based On Nearest Subspace Search
5	Improving Crow Search Algorithm To Optimize KNN Parameters And Feature Selection For Vulnerability Classification
6	Research Of Nearest Neighbor Classification Algorithm Based On Sample Selection
7	Approximate Nearest Neighbor Search For High-Dimensional Based On Nearest Neighbor Graph
8	Study On Generalized Nearest Neighbor Pattern Classification
9	Multiple Hash Tables Indexing And Optimization For Approximate Nearest Neighbor Search
10	Research On The Nearest Neighbor Discrimination Method For Adversarial Sample Detection