A methodology of machine learning in automated entity summarization

Posted on:2017-11-25

Degree:Ph.D

Type:Thesis

University:The Pennsylvania State University

Candidate:Chonde, Seifu

Full Text:PDF

GTID:2458390008961843

Subject:Industrial Engineering

Abstract/Summary:

PDF Full Text Request

Conducting background research is a time consuming, yet important, part of every research endeavor. It includes compiling relevant sources, reading those sources, and comprehending the information. We find that this information scales rapidly in the current information age. The use of automated text summarization, among other techniques (e.g., search engines), helps to improve efficiency in exploring data by distilling large amounts of information that is becoming prevalent.;For the purpose of summarizing entity and topic interaction in large information stores, in this dissertation a methodology of automatic entity summarization is presented. The methodology is broken into three steps: Reading, Assembly, and Interpretation. In the Reading step, the appropriate information sources are determined and, subsequently, the interrelated entities are extracted within each source. Four inputs are necessary in this step: a topic extraction algorithm, a named entity recognition algorithm, information sources, and property information for the entities. In the Assembly step, the relationships between entities across sources is represented through knowledge networks. A trimodal weighted co-occurrence hypergraph is presented and then projected into unimodal and bimodal graphs. Finally, in the Interpretation step, graph analytics are presented to summarize the graphs. A novel diversity heuristic is derived based on information entropy to compare information diversity in different streams of literature over time.;To test the methodology, three experiments were conducted. Data from the PubMed Central Open Access Subset, which consisted of 740,418 journal citations in 4,404 journals, was downloaded on July 14, 2014. The first experiment addressed the relationship between the size of the information network and the number of files input into the methodology. It was found that a power law relationship exists, as shown in linguistic theory. The second experiment addressed the validity of the methodology in extracting meaningful connections and predicting the top chemicals using two gold standards. Results indicate that the methodology can be used to determine the top chemicals and that meaningful connections are those with the highest weight in the network. Finally, the diversity heuristic was used in the third experiment to empirically compare the diversity of information in a stream of articles relating to honeybee research to the diversity of information in a stream of articles relating to diabetes research. It was seen that the existing heuristic provides quite noisy results when applied to information networks and that the new heuristic has better asymptotic properties. This research is among the first efforts towards building improved literature-based discovery algorithms that are capable of automating the hypothesis generation process in large literature sets. iv.

Keywords/Search Tags:

Methodology, Information, Entity, Sources

PDF Full Text Request

Related items

1	A quantitative methodology for vetting 'dark network' intelligence sources for social network analysis
2	An Entity Data Model Based Methodology For The Development Of Information System
3	The SigmaIQ methodology: An information quality perspective on oil data
4	A sociometric analysis of information-seeking behavior, information sources, and information networks in boards, committees and commissions in a small rural Iowa community
5	Research On Joint Extraction Of Entity Relations By Fusing Entity Local Information
6	Research On Entity Linking Method And Implementation Of Entity Linking System In Information Security Domain
7	Research On Entity-level Search Crawler And Information Extraction
8	The use of information sources by small Pennsylvania manufacturers
9	Combining Information from Multiple Sources in Bayesian Modeling
10	Research Of Entity Knowledge Base System Based On Information Extraction