Font Size: a A A

Research And Implementation Of Feature Extraction-Based Indexing System On Enterprise Text Data

Posted on:2016-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2308330473465497Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Internet is becoming the main carrier and exchange platform of human information, which collects a variety of information. With the advent of the big data era,80% of the data are in the form of text which are stored on the network, thus the data mining of massive enterprise text data has become increasingly important to enterprises and users.Currently the text data mining is a hot research topic, however, when it comes to enterprises, the relative research is rare. There are the two problems when processing the enterprise text:1) After the text word segmentation, if using all the segments to represent the text, which brings the curse of dimensionality; 2) Using general search engine to search enterprise text, which brings the problems that the content searched is wide and information filtering degree is low.In order to allow users to access enterprise text information easily, research on these two issues is carried out and solutions are proposed in this thesis. In order to solve the curse of dimensionality, the feature extraction method which combines rule-based, statistics-based with segments speech tagging, which is based on the summation of the characteristics of the enterprise text, is proposed in this thesis. The method needs enterprise text word segmentation and segments speech tagging, then it needs information marking which combines trigger word directory which is generated by rule with segments speech tagging to generate an observe sequence which is decoded by the statistical model to generate the information which is needed. Experiments shows that the method has higher recall rate, accuracy and dimension reduction rate; In order to solve the problem that information filtering degree is low in general search engine, building an enterprise-oriented search engine platform is proposed in this thesis. In order to solve the shortcomings of traditional search engine text sort algorithm, the improved text sort algorithm which combines PageRank value, classification with TF-IDF value, is proposed in this thesis. The keywords of the users’ queries are pre-classified in the improved text sort algorithm to predict which class the users’ input keywords most likely belong to. Similar data were prioritized taken from enterprise text database based on this, which makes text relevant to the subject displayed in the front. Experiments show that the sorting algorithm has faster query response time and higher precision rate. Finally, design and implement an enterprise text search engine prototype system which used text feature extraction method and improved text sorting algorithm further proof their feasibility and effectiveness.
Keywords/Search Tags:Enterprise Text, Feature Extraction, Text Search Engine, Sorting Algorithms, Classification
PDF Full Text Request
Related items