Research On File Preprocessing Technology In Full-text Retrieval System

Posted on:2018-06-09

Degree:Master

Type:Thesis

Country:China

Candidate:M D Yan

Full Text:PDF

GTID:2348330515497272

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the continuous development of computer technology and network technology,the amount of data in human society shows explosive growth.The main goal of information retrieval research is how to search useful information quickly and accurately.The internet information has a variety of types,with semi-structured and unstructured information occupying a large part of the information.Structured data can be retrieved by using the database technology,but the search of unstructured data is short of such useful tools,so the full-text retrieval technology came into being.The research background of this thesis is based on distributed full-text retrieval system.The system is composed of text preprocessing,index building,index management and web search platform.This thesis mainly studies the related technologies of the file preprocessing module,such as file real-time monitoring,file type identification,and file text extraction.The module uses the Inotify mechanism to monitor the data source in real time,submits the monitored file path to the message queue which is based on the AMQP protocol,identifies the file type,and uses different interfaces to extract the text content of the file according to the different file types.Finally,a large number of files are prepared to test the function and performance of the preprocessing module.The experimental results show that the module has high recognition accuracy and good text extraction completeness and meets the basic requirements.This thesis also studies the content-based file type recognition algorithm.The content of the file is divided into byte values,and the vector space model of the file is established by using the byte value and the byte value frequency.The recognition process uses the k-nearest neighbor as the classification algorithm.In order to reduce the complexity of the classification process,the principal component analysis algorithm and the clustering algorithm are introduced.Finally,the algorithm is tested.The experimental results show that the improved algorithm reduces the classification time,and has high classification efficiency and recognition accuracy.Finally,the information gain feature selection algorithm and the TFIDF weight calculation algorithm are used in the file classification process.When the sample set distribution is unbalanced,the classification performance will decline,to solve this problem,the distribution information among classes and the distribution information inside a class on the basis of the traditional algorithm are introduced.Using the support vector machine as a classifier,the results show that the classification accuracy of the unbalanced file set has been improved.

Keywords/Search Tags:

full-text retrieval, message queue, file classification, feature reduction, weight calculation, knearest neighbor, support vector machine

PDF Full Text Request

Related items

1	Text Sentiment Analysis Based On Text Classification
2	The Design And Implementation Of Text Classification System Based On SVM-KNN
3	Research On Improved K Neighbor Support Vector Machine Algorithm Faced Text Classification
4	Study On Text Classification Based On Rough Set And Support Vector Machine
5	Research On Chinese Text Categorization Based On The Integrated Support Vector Machine Method
6	Designed And Implementation Of Chinese Text Categorization System Based On Support Vector Machine
7	Research Of Chinese Text Classification Based On Mixed Feature
8	Research And Implementation Of Spark-based Text Classification
9	Research On Text Classification Based-on Support Vector Machine
10	Automatic Classification Research On Chinese Web Document Orientation