Font Size: a A A

High Dimensional Aggregation Model Of Digital Literature Resource

Posted on:2015-02-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:F G NiuFull Text:PDF
GTID:1228330428975350Subject:Information Science
Abstract/Summary:PDF Full Text Request
Scientific experimental data, statistical data, metadata...we are living a life filled with "big data"; natural information, social information, new information, aging information......, we are surrounded by the lager volume of information, just like we were in the sea. However, we are struggling hard to find the desired knowledge. Retrieval system provides us a very efficient way to find and retrieve information, and the Internet offers a larger platform for that, and data fusion and information resource integration provide much more contents to retrieve. State’s promotion to the construction of information resources, the popularity of Internet applications, and enhancement of the retrieval technology make great contributions to the knowledge acquisition and knowledge services, but it is still struggling to cope with the rapid growth of information. With the continuous improvement of information resources construction, resource aggregation will become an extension of data integration and resource integration concepts, which continues to playing a full role in achieving the function of knowledge discovery. Aggregation more comprehensive information resources on the basis of integration, and then find ways for mining knowledge, discovery and recommendation of knowledge, this may become a paradigm for the future information resource aggregation.From the perspective of digital resources, this paper aims to reduce reliance on the background knowledge in information resource aggregation and make an easy access to the application, and then it proposes CLSVSM (Co-occurrence Latent Semantic Vector Space Model), which enables to cluster documents, based on the co-occurrence information of literature collection itself or related fields. For the academic literature or text clustering problem, there are two types of solution ideas in academic circles, one is to improve model of the literature representation, and the other is to improve the algorithm. There are some shortcomings of traditional algorithms for high-dimensional sparse vector clustering. Some new algorithms are not perfect. The main reason is the effect of clustering algorithm is closely related to the characteristics of the data itself, information extraction and representation, especially under the limited information circumstances, the advantages of clustering algorithms can not get a perfect play, in contrast, the mining and extracting information, document vector representation are particularly important. Under the premise of limited metadata or keywords in this article, it means that the vector representation of the literature is especially sparse compared with the general text representation, under this situation, clustering algorithm has no way to solve this problem, just like a "housewife" can not do anything without "rice". Therefore, the key breakthrough in this paper is to extract and utilize semantic information of documents, and then to put foeward a new vector representation of the literature. CLSVSM model is proposed in this case. And based on the literature, the done experiment has confirmed that clustering CLSVSM performs better than VSM model and GVSM.This paper is to study the theory and practical application-oriented approach. The text consists of7chapters. Except for the front two chapters, which are introduction and conclusion, the remaining chapters are summarized as follows:In Chapter1, it is to clarify the concept, sorting basic research theory and then propose a starting point. In the aspect of elaborating concepts, firstly, it sorts out the study object-the scope of digital literature resources. Secondly, it focuses on the resource aggregation connotation and denotation, on this basis, the author proposes a three-levels explanation of the concept of aggregation, which is from data fusion to integrate resources and then to knowledge discovery, and then pays full attention to the discovery of cluster knowledge. Finally, it summarizes the literature resources aggregated form and content. In the process of sorting out the basic research theory, based on the methods of literature analysis, it analyses the text mining theory which can help to extract the main characteristics of the literature information, and also analyses co-occurrence analysis contributing to get the semantic information in the language use, it also analyze the latent semantic analysis which is helpful for information and latent calculations, analyzing the feature aggregation literature theory that can help explain the results clustering, and information entropy for feature extraction and clustering evaluation, as well as the long tail theory, which posses the guiding significance the importance for the choice of the characteristic words. On the basis of the above discussion, it proposes the establishment of metadata-based (mainly key words) feature vector and then means to realize the basic idea of clustering on the practical application-oriented.In Chapter2, the author explores the high-dimensional feature vector representation of literature and the metrics of literature similarity, in order to find the innovative ideas of the model. First of all, literature’s attribute features is diverse and high-dimensional. This article mainly aims to realize the literature clustering based on its theme, so choose the characteristics of the literature which reflects the theme in order to represent the literature and choose the most practical algebraic representation method. Secondly, from the Vector Space Model (VSM) to generalized vector space model (GVSM), then to the semantic vector space model (SVSM) conduct a comparative study, and describe related improvements of the representative model. Finally, chose the potential of existing literature to supplement the semantic in Boolean value indicate vector, which aims to form a new vector representation model.In Chapter3, it provides the co-ocurrence latent semantic vector space model (CLSVSM) and the literature clustering step based on CLSVSM, which clarify the co-occurrence latent semantic concepts further, and illustrate the extraction and utilization of it. Then use supplement vector semantic information as the main breakthrough, by means of the co-occurrence analysis method to extract the latent semantic information, and then stack the feature information of basic literature which indicate vectors in order to form a new representation model. The new model not only involves a theme of literature itself, but also includes relationship implied co-occurrence of a set of characteristic words. Therefore, it fully embodies the theme, of literature information. The new model is defined as co-occurrence latent semantic vector space model, which is abbreviated as CLSVSM. On the basis of CLSVSM, select the cosine similarity measure, the appropriate algorithm and the criterion function’s procedure of clustering. Finally, it compared with some models in theory, especially models representing Chinese literature clustering.In Chapter4, through the experiment, testing the CLSVSM’s clustering effect, it compares primarily with VSM and GVSM model. The quality of clustering needs to be compared with the original classification in order to evaluate. This paper chose two experimental data source, one is reprinted G9"Library and Information Science". Category with its columns, the test results are poor. The main reason is the column as the category itself does not have a clear standard, so naturally it can not evaluate the quality of clustering. The other data set is the CNKI-the "Publish Science ","Library and Information Science and the Digital Library" and "Files and Museum", which three disciplines sample literature sets. The experiments show CLSVSM-based clustering effect is very good, from the view of the entropy, purity and BF value, the cluster result is more than the VSM-based clustering for at least24%, and more than the GVSM-based for at least13%. Therefore, it is considered that CLSVSM is more successful for the literature topic clustering. The above experiments were carried out on gCLUTO platform. In Chapter5, it uses the CLSVSM model for practice, testing deep-aggregation’s ability to highlight the theme. The first empirical study selects the literature sampling about probability theory and mathematical statistics disciplines, the character is applied science disciplines, and themes are relative dispersion. The second one chooses the set of retrieve documents, and its character is relatively concentrated theme, mainly to test the situation of clustering of search results. Empirical study is still on gCLUTO platform, by means of three methods to determine the number of clusters highlighting its theme, and then selects the less number of clusters comparative with the more one. The two empirical results indicate that the CLSVSM-based clustering is better to divides the literature base set, and the more number of clusters, the more prominent the ability of small-scale studies on the theme of the literature base set, and the more the number of clusters implies the finer division of theme relations, then the deeper digging on the subject. Therefore, it means that the depth of aggregation of the literature topics.To sum up, facing a huge task resource aggregation, this article focuses on the clustering of the literature aggregation, proposing CLSVSM model and excellence it in the experiment. CLSVSM model not only provides a feasible way in the condition of the limited information which the document clustering has, but also provide a reference for similarity measure of literature aggregation, document retrieval, document classification research and practice.
Keywords/Search Tags:digital literature resources aggregation, high dimensions, clusteringknowledge discovering, co-ocurrence analysis, co-occurrence latent semantic vectorspace model (CLSVSM)
PDF Full Text Request
Related items