Font Size: a A A

Design Of Scientific And Technical Information Collection System And Research On Its Fast Text Clustering Algorithm

Posted on:2015-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:C SongFull Text:PDF
GTID:2298330452453249Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Scientific and technological information collection is an important part on theresearch in Information of Science and Technology, is also the foundation of analysison scientific and technological information. With the explosive growth of networkinformation, science and technology intelligence researchers have to spend too muchtime and energy on the work in data collection and statistical analysis, which cannotmeet the needs of information retrieval and analysis. In order to obtain moreprofessional, more accurate, more comprehensive, and faster information whicheffectively assist related department to make scientific decisions and timely guidance,it is necessary to study how to gather and analysis data more effectively. So this papercompleted the following two aspects:1) This paper designs and implements an unsupervised technological intelligencegathering systems. Firstly, the system uses meta-search model and vertical searchmodel to retrieve data and papers on the web. Through URL scheduler, memorymanagement, data storage, source code parser and multi-threaded control module, toachieve a automatic data collection system which can work without humanintervention and can automatic response to emergency situations. Then use dataanalysis module papers to analysis data from papers automatically which can provideinformation and guidance for the in-depth analysis and research. Finally, we test oursystem on aircraft manufacturing areas, the experimental results show that the systemcan collect data and papers on the web effectively and can complete some systematicintelligence analysis.2) In view of the large amount of data and a lot of duplicate data in web datawhich is difficult to deal with the data artificially, this paper proposes a text clusteringalgorithm based on quick sort to remove duplicate data and compress data. Firstly weconvert text clustering problem into a numeric sort problem based on the similaritymeasure between the texts and use quick sort algorithm to achieve cluster. Then usingthe value randomization strategy and recursive operations to further improve theefficiency to achieve near-linear time complexity. Finally, we perform lots ofexperiments on real data and artificial data to test our algorithm with the classicCURE, BIRCH, K-means algorithms. The results show that our algorithm can notonly guarantee the clustering accuracy, but also has faster execution efficiencyespecially when dealing with large-scale web data.
Keywords/Search Tags:Technology intelligence collection, unsupervised system, intelligenceanalysis technology, fast text clustering, quick Sort
PDF Full Text Request
Related items