Font Size: a A A

AutoLink Semi-supervised Multi-label Study Of Literature Research And Implementation Methods

Posted on:2015-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:M C ZhangFull Text:PDF
GTID:2268330428977018Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the deepening of the interdisciplinary collaboration study, many documents are in-terdisciplinary and the number of literature in database is growing at the rate of millions every year. Now the automatical extract of deep faceted classification tree in a field has come true. But how to automatically link the millions of literature to the faceted classification tree and promote the literature to be rapidly and quickly retrieved has become a problem to be solved.Through analysis, the dissertion changes the problem of automatic link of literature into a multi-label classification problem to solve. However, we can neither mark a large amount of data to train the classifiers; nor can we ignore completely the small markingsā€™guide to the classification. In the era of big data, we should consider both the accuracy of algorithm and the time of executing it. Therefore, this dissertion studies the semi-supervised multi-label leaning algorithm based on distributed framework. The detailed work is introduced as fol-lows:1) The dissertion analyzes the existing strategies of tabbed algorithm, and determines First-order strategy as the research plan according to their respective advantages and disad-vantages.2) Considering the shortcomings of the label propagation algorithm, the effect of labeled data and unlabeled data on the propagation of algorithm and the clustering hypothesis, the dissertion puts forward the multi-level label propagation algorithm based on the data recon-struction. Experiments show that the improvement of label propagation algorithm is effective. Under the condition of invariable time complexity, it improves the algorithm accuracy.3) The dissertion also analyzes the distributed file system in the Hadoop and MapReduce which are two core components. By using the laboratory equipment, the author successfully sets up the distributed computing environment of three nodes.4) The author improves the matrix multiplication method under the distributed frame-work and changes the multi-level label propagation algorithm based on the data reconstruc-tion into the semi-supervised label learning algorithm. In order to improve the classification accuracy, the author extracts the label-specific features before classification which means re-ducing the data dimension. Compared with the existing multiple label algorithm, the multiple label learning algorithm built by the author has more advantages. It has better performance and can deal with massive data. The bigger the data size is and the more the computer hosts are, the greater the accelerating algorithm is.5) By Lucene, we develop a prototype system of automatical link of literature which has the functions of keywords retrieval and faceted retrieval. In the beginning, it uses the DFT-Extractor system to obtain the required faceted classification tree. In establishing the faceted indexing, the semi-supervised label learning algorithm is embedded in. Later, through the test on ACM data; it further explains the method is effective. At the same time, it can low-er the hardware requirements, reduce costs and have greater practical value.
Keywords/Search Tags:faceted retrieval, semi-supervised learning, multi-label learning, label-specificfeatures, Hadoop
PDF Full Text Request
Related items