Font Size: a A A

Research On File Preprocessing Technology In Full-text Retrieval System

Posted on:2018-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:M D YanFull Text:PDF
GTID:2348330515497272Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of computer technology and network technology,the amount of data in human society shows explosive growth.The main goal of information retrieval research is how to search useful information quickly and accurately.The internet information has a variety of types,with semi-structured and unstructured information occupying a large part of the information.Structured data can be retrieved by using the database technology,but the search of unstructured data is short of such useful tools,so the full-text retrieval technology came into being.The research background of this thesis is based on distributed full-text retrieval system.The system is composed of text preprocessing,index building,index management and web search platform.This thesis mainly studies the related technologies of the file preprocessing module,such as file real-time monitoring,file type identification,and file text extraction.The module uses the Inotify mechanism to monitor the data source in real time,submits the monitored file path to the message queue which is based on the AMQP protocol,identifies the file type,and uses different interfaces to extract the text content of the file according to the different file types.Finally,a large number of files are prepared to test the function and performance of the preprocessing module.The experimental results show that the module has high recognition accuracy and good text extraction completeness and meets the basic requirements.This thesis also studies the content-based file type recognition algorithm.The content of the file is divided into byte values,and the vector space model of the file is established by using the byte value and the byte value frequency.The recognition process uses the k-nearest neighbor as the classification algorithm.In order to reduce the complexity of the classification process,the principal component analysis algorithm and the clustering algorithm are introduced.Finally,the algorithm is tested.The experimental results show that the improved algorithm reduces the classification time,and has high classification efficiency and recognition accuracy.Finally,the information gain feature selection algorithm and the TFIDF weight calculation algorithm are used in the file classification process.When the sample set distribution is unbalanced,the classification performance will decline,to solve this problem,the distribution information among classes and the distribution information inside a class on the basis of the traditional algorithm are introduced.Using the support vector machine as a classifier,the results show that the classification accuracy of the unbalanced file set has been improved.
Keywords/Search Tags:full-text retrieval, message queue, file classification, feature reduction, weight calculation, knearest neighbor, support vector machine
PDF Full Text Request
Related items