Design Of Scientific And Technical Information Collection System And Research On Its Fast Text Clustering Algorithm

Posted on:2015-08-17

Degree:Master

Type:Thesis

Country:China

Candidate:C Song

Full Text:PDF

GTID:2298330452453249

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Scientific and technological information collection is an important part on theresearch in Information of Science and Technology, is also the foundation of analysison scientific and technological information. With the explosive growth of networkinformation, science and technology intelligence researchers have to spend too muchtime and energy on the work in data collection and statistical analysis, which cannotmeet the needs of information retrieval and analysis. In order to obtain moreprofessional, more accurate, more comprehensive, and faster information whicheffectively assist related department to make scientific decisions and timely guidance,it is necessary to study how to gather and analysis data more effectively. So this papercompleted the following two aspects:1) This paper designs and implements an unsupervised technological intelligencegathering systems. Firstly, the system uses meta-search model and vertical searchmodel to retrieve data and papers on the web. Through URL scheduler, memorymanagement, data storage, source code parser and multi-threaded control module, toachieve a automatic data collection system which can work without humanintervention and can automatic response to emergency situations. Then use dataanalysis module papers to analysis data from papers automatically which can provideinformation and guidance for the in-depth analysis and research. Finally, we test oursystem on aircraft manufacturing areas, the experimental results show that the systemcan collect data and papers on the web effectively and can complete some systematicintelligence analysis.2) In view of the large amount of data and a lot of duplicate data in web datawhich is difficult to deal with the data artificially, this paper proposes a text clusteringalgorithm based on quick sort to remove duplicate data and compress data. Firstly weconvert text clustering problem into a numeric sort problem based on the similaritymeasure between the texts and use quick sort algorithm to achieve cluster. Then usingthe value randomization strategy and recursive operations to further improve theefficiency to achieve near-linear time complexity. Finally, we perform lots ofexperiments on real data and artificial data to test our algorithm with the classicCURE, BIRCH, K-means algorithms. The results show that our algorithm can notonly guarantee the clustering accuracy, but also has faster execution efficiencyespecially when dealing with large-scale web data.

Keywords/Search Tags:

Technology intelligence collection, unsupervised system, intelligenceanalysis technology, fast text clustering, quick Sort

PDF Full Text Request

Related items

1	Research Of Text Clustering Technology Based On Colony Intelligence
2	The Study And Application Of Web Text Data Mining Technology Based On The Approximate Pages Clustering Algorithm
3	Text Classification Method Based On Unsupervised Clustering And Naive Bayesian Classifier
4	Research On Key Implementing Technologies Of Quick Response Information System For Textile And Costume
5	Research And Realization Of Dictionary Resource Process System And ECU Dict
6	Research And Application On Technologies Of Text Clustering Oriented To Enterprise Competitive Intelligence
7	The Research And Implementation Of Modern Enterprise Intelligence Information System Based On Text Mining
8	Analysis And Design Of Competitive Intelligence System Of Enterprises Based On Text Mining
9	A text mining framework linking technical intelligence from publication databases to strategic technology decisions
10	Design Of Open-source Intelligence Collection System Based On Big Data