Font Size: a A A

Research On Key Techniques In Cross-document Fusion

Posted on:2017-04-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:L YueFull Text:PDF
GTID:1108330482494776Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the online-corpus is gigantic in its volume, Web search engines often return more search results than actual needs. Going through all these results to obtain target information is infeasible and time consuming. Thus, automatic document summarization has been proposed for salient information retrieval and high-efficiency knowledge acquisition, which aims to produce a shortest description containing the most important information within all documents.A relevant research, document fusion, builds upon this foundation, which aims to produce a shortest description containing all information found within the document sets, but without repetition. The significant difference between these two areas is that the aim of multi-document summarization is to produce a shortest description containing the most relevant information, while the aim of document fusion is to produce a shortest description containing all information found within the document sets, but without repetition. It is like that multi-document summarization is the intersection of multiple documents, while document fusion is the union of multiple documents. To date, numerous approaches for identifying important content for automatic text summarization have been developed, such as topic representation approaches first derive an intermediate representation of the text that captures the topics discussed in the input, and sentences in the input document are scored for importance based on these representations of topics.Within the scope of our paper, we focus on three key techniques in cross-document fusion: 1. the problem of object merging in information fusion; 2. sentence ordering which constructs a coherent structure extracted from multiple documents and guarantees the fluency and readability of the results; and 3. document clustering based on domain-specific ontology.1. OMFM: A Framework of Object Merging based on Fuzzy Multisets Information fusion is a process of merging information from multiple sources into a new set of information. Existing work on information fusion is applicable invarious scenarios such as multi-agent systems, group decision making and multi-document summarization. We develop an effective framework to solve object merging problem based on fuzzy multisets. The objects defined in this part are data segments in the document fusion task, referring to the concepts with semantic-related terms of different semantic relations embedded. The fundamental operation is a merge function mapping data segments in fuzzy multisets onto one object, which is a solution. Under this framework, we define quality measures of purity and entropy to quantify the quality of any merge function, balancing accurateness and completeness of a given solution. Merge function that yields the best solution is VI-optimal merge function and a series of theoretical properties concerning it are studied. Finally, we investigate the proposed framework in a special application scenario: documents fusion, and show how the framework works with illustrative examples.2. Sentence Ordering based on Continuous Hopfield Neural Network for Document Fusion Sentence ordering is a task of composing a coherent structure of the sentences extracted from multiple documents, which guarantees the fluency and readability of automatic document summarization and document fusion. The correct order of these sentences can be helpful for human understanding of the input articles. Moreover, the problem of information ordering is not limited to the areas mentioned hereinbefore, and concerns all natural language generation(NLG) applications, such as discourse planning and sentence aggregation. Besides, a brief, well-organized, fluent answer at a specified level of granularity is also applicable in real-world question answering system, which is a classical application in social search. Sentence ordering problem in this part is treated as a combinatorial optimization problem and solved with continuous hopfield neural network(CHNN), which transforms the objective function in optimization problem into energy function in neural network, and maps variables into the state of network. Specifically, we propose utilizing CHNN to improve ordering results, which examines most frequent orders in original document and considers the topical relevance between local themes during overall ordering process, where the ordering algorithm traverses all the local themes once and searches a shortest path as the final sentence ordering. Assessing the quality of sentence ordering generated by algorithm is a non-trivial task. Three semi-automatic evaluation measures that have been used in previous work are employed in this paper, whichcompare a sentence ordering result produced by an algorithm against the ordering produced by human annotator. They are evaluation measures of rank correlation coefficients such as Spearman’s rank correlation, Kendall’s rank correlation, and evaluation measure of assessing continuity of pairwise sentences, which is called Average Continuity. The experimental results suggest the effectiveness of our method compared with random ordering(RO), chronological ordering(CO), majority ordering(MO), and precedence relation ordering(PRO). During subjective grading, the distribution of the subjective grading indicates that there is still a vast amount of work need to be done to pushing poor ordering to an acceptable level or a perfect level.3. A Fuzzy Document Clustering Approach based on Domain-specified Ontology Document clustering techniques include automatic document organization, topic extraction, fast information retrieval or filtering, etc. Numerous methods have been developed for document clustering research. Despite the advances achieved, however, document clustering still presents certain challenges such as optimizing feature selection for low-dimensional document representation and incorporating mutual information between the documents into a clustering algorithm. This paper mainly focuses on these two questions. Firstly, we construct a domain-specific ontology that provides the controlled vocabulary describing the hazards related to dairy products. Synonyms of the controlled vocabulary in document set are considered to be relatively prevalent and fundamentally important for feature selection. Secondly, in combination with the vector space model(VSM), we perform singular value decomposition(SVD) to translate all of the term-document vectors into a concept space. We then obtain the mutual information between documents by calculating the similarity of every two document vectors in the orthogonal matrix of right singular vectors. As the mutual information matrix is also a fuzzy compatible relation, a fuzzy equivalence can be derived by calculating max-min transitive closure. Finally, based on the fuzzy equivalence relation, all of the data sequences are easily allocated into clusters under the guidance of a cluster validation index. Our method both reduces the dimensionality of the original data and considers the correlation between the terms. The experimental results show that encoding the ontologies in the aggregation process could provide better clustering results. Moreover, the proposed work has been applied to food safety supervision which is beneficial for governmentand society.
Keywords/Search Tags:Document fusion, Fuzzy multisets, Object merging, Semantic relations, Sentence ordering, Continuous Hopfield neural network, Domain-specified ontology, Document clustering
PDF Full Text Request
Related items