Research And Implementation Of Feature Extraction-Based Indexing System On Enterprise Text Data

Posted on:2016-10-07

Degree:Master

Type:Thesis

Country:China

Candidate:J Liu

Full Text:PDF

GTID:2308330473465497

Subject:Computer applications

Abstract/Summary:

Internet is becoming the main carrier and exchange platform of human information, which collects a variety of information. With the advent of the big data era,80% of the data are in the form of text which are stored on the network, thus the data mining of massive enterprise text data has become increasingly important to enterprises and users.Currently the text data mining is a hot research topic, however, when it comes to enterprises, the relative research is rare. There are the two problems when processing the enterprise text:1) After the text word segmentation, if using all the segments to represent the text, which brings the curse of dimensionality; 2) Using general search engine to search enterprise text, which brings the problems that the content searched is wide and information filtering degree is low.In order to allow users to access enterprise text information easily, research on these two issues is carried out and solutions are proposed in this thesis. In order to solve the curse of dimensionality, the feature extraction method which combines rule-based, statistics-based with segments speech tagging, which is based on the summation of the characteristics of the enterprise text, is proposed in this thesis. The method needs enterprise text word segmentation and segments speech tagging, then it needs information marking which combines trigger word directory which is generated by rule with segments speech tagging to generate an observe sequence which is decoded by the statistical model to generate the information which is needed. Experiments shows that the method has higher recall rate, accuracy and dimension reduction rate; In order to solve the problem that information filtering degree is low in general search engine, building an enterprise-oriented search engine platform is proposed in this thesis. In order to solve the shortcomings of traditional search engine text sort algorithm, the improved text sort algorithm which combines PageRank value, classification with TF-IDF value, is proposed in this thesis. The keywords of the usersâ€™ queries are pre-classified in the improved text sort algorithm to predict which class the usersâ€™ input keywords most likely belong to. Similar data were prioritized taken from enterprise text database based on this, which makes text relevant to the subject displayed in the front. Experiments show that the sorting algorithm has faster query response time and higher precision rate. Finally, design and implement an enterprise text search engine prototype system which used text feature extraction method and improved text sorting algorithm further proof their feasibility and effectiveness.

Keywords/Search Tags:

Enterprise Text, Feature Extraction, Text Search Engine, Sorting Algorithms, Classification

Related items

1	Research On The Topical Search Engine Based On Semantic
2	Research And Application Of Short Text Classification In Search Engine
3	Web Text Mining Research Based On Subject-oriented Search Engine
4	Research On Enterprise Competitive Intelligence Collection System Based On Web Text Mining
5	The Research And Application Of Segmentation And Sorting In Vertical Search Engine
6	The Research And Implementation Of Enterprise Search Engine Based On Lucene
7	The Intelligent Full Text Retrieval System Based On Topic Sorting And Recommendation
8	Research On Agricultural Information Search Engine Classifier
9	The Study Of Key Technologies For Chinese Domain-Oriented Search Engine
10	Design And Implementation Of Text Classification Model Based On The Improved TF-IDF Feature Extraction