Analysing The Academic Impact Of Software:A Bootstrapped Learning Of Software Entities In Full-text Papers

Posted on:2017-10-12

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X L Pan

Full Text:PDF

GTID:1368330485468093

Subject:Information Science

Abstract/Summary:

Software plays an important role in the advancement of science.It is used in many aspects of scientific research,such as processing control,data process,results analysis,and knowledge diffusion,et al.Although there is a consensus that software is useful to the scientific community,software has long been considered as a supporting service instead of a formal research product.According to the current scientific reward system,a scientist’s worth is mainly dependent on publications.The scientific value of software has long been underestimated or,at worst,has been ignored.In fact,scientists consume considerable time in developing software and they share their software with others.Sharing software reduces the barriers of software use and benefits the scientific community.They have deep interest in the use and impact of software.In addition,funding agencies also want to know the impact of the software that they supported,because they need to use it to evaluate their investments and make a decision about providing additional funding for the software project.A study of the impact of software is thus imperative,because it will satisfy the needs of scientific software developers and funding agencies and it will help build an open,transparent,and inclusive scientific reward system.However,little is known the impact of software.In this study,we focus on studying the academic impact of software.Additionally,we analyze how scholars mention and cite software in the scientific literature and try to find the important factors that influenced user’s citation behavior.Finally,we provide recommendations to improve the practice of software use and software citation.In this article,we first introduce the background and meanings of analyzing the impact of software,then present the research questions and show the research procedures.Second,we provide a comprehensive review of the research on entity extraction,scientific data sharing and citation,the development,sharing,use,and impact assessment of scientific software.For entity extraction,we introduce the conception and short history of information extraction,and define the conceptions of entity extraction and named entity recognition,then provide a review of three kinds of entity extraction(i.e.,rule-based approaches,machined learning-based approaches,and hybrid approaches),summarize the pros and cons of machine learning and rule-based approaches,finally give the reasons for choosing bootstrapping methods which are rule-based approaches.We also provide a review of scientific data sharing and citation,because a parallel can be drawn between software and data in scientific literature and the studies of scientific data sharing and citation have great reference value to study of the academic impact of software.In addition,we investigate the development,sharing and use and impact of the software that is free for academic use based on a simple process model framework of software in science presented in a previous study.And then we propose an improved bootstrapping method to extract software entities from full-text paper,analyze the academic impact of software using the number of software mentions in full-text and the number of software citations,and investigate software use and citation behavior of scientists from library and information science.The three parts are related to each other:both analyzing the academic impact of software and investigating software use and citation behavior of scientists from library and information science take the software entities extracted by the proposed automatic software entity extraction methodas their research objects;a study of software use and citation behavior of scientists from library and information science verifies some assumptions made in the automatic software entity extraction method and confirms some findings found in the study of the academic impact of software,and we provide recommendations to improve the practice of software use and citation based on the findings of patterns and influence factors of scientists’ software use and citation behaviors.The third section proposes an improved bootstrapping method to extract software entities from full-text scientific papers.The input to this method is a text corpus and a set of software seed terms.The method has combined multiple features that include the presence of uppercase letters,version numbers,left context triggers,and right context triggers to estimate the probability that an unlabeled entity is software.And then we design,construct,and implement a software extraction system based on the proposed algorithm.Finally,we measure the performance of our software extraction system:Three hundred and eighty six papers published in PLOS ONE between January 5 and January 12,2014 are selected as the test set,two coders separately annotated the test set and the manually labeled software entities in this set are considered as the gold standard for evaluation,which comprised 470 entities;three entity extraction systems are employed as the baselines to benchmark our system,including Basilisk,NOMEN,and SPIED;precision,recall,and F1 score are used as the evaluation metrics to measure the performance of our system and baselines;the precision that our system achieved is considerable higher than that of other systems,in term of recall and F1 score,our system has the highest recall and F1 score.Overall,our system our entity extraction system outperforms the baselines including Basilisk,NOMEN,and SPIED on extracting software entities from the data set.Moreover,we find that using pattern accuracy metric and entity features to filter unlabeled entities improves the extraction performance.The fourth section investigates the use and impact of software by examining how software packages are mentioned and cited among 9548 articles published in PLOS ONE in 12 defined disciplines.The bootstrapping method proposed in the third section is used to identify software entities from the 9548 papers.Before we count the numbers of mentions and citations of a software entity,its variants are manually consolidated.Then we observe the software mention and use rules in person and write a program to automatically count the numbers of mentions and citations of software in paper unit and discipline unit.Finally,we analyze the statistical results.Overall,there are the following findings:software is widely used in the scientific community;the practice of software citation varied from discipline to discipline(for example,while software citation is more consistently practiced in fields such as environmental sciences,computer and information sciences,and earth sciences,more than 55%of the mentioned software received no citation in chemistry,mathematics,or engineering.);a noticeable uncitedness of software has also been revealed.These findings suggest that using the number of citations as a single metric to assess the value of software is not adequate.The number of software mentions in full texts should also be taken into account when assessing the impact of software on science.It is worth noting that the results are limited to papers published in PLOS ONE.Before we generalized the findings of this study,more efforts and data sources are needed to examine software attribution and citation.In the fifth section,we examine software use and citation through content analysis of journal articles from library and information science because we cannot obtain more characteristics of software mentioned based on automatic software identification and measurement method.In addition,this section verifies some assumptions made in the automatic software entity extraction method and confirms some findings found in the study of the academic impact of software.We choose a set of 11 journals in library and information science using the 2014 Institute for Scientific Information Web of Science.A random number generator is used to choose research papers from the journals.We choose nine papers from each journal,a total set of 99 library and information science research articles.According to the research objective,we developed our coding scheme.And then we manually identify software entities from the selected papers and code the contents.This study finds that verifies that a citation identifier usually occurs in the substring that starts from the software entity to end of the sentence.It also finds that most scholars mentioned software because they used it in their research.This study also finds that the practices of mentioning and citing software are diverse.The scholars from library and information science do not provide more information about software such as version and websites when they mentioned software in the articles.A noticeable uncitedness of software has also been revealed in library and information science.Moreover,the study finds that a software package will receive more citations if its creators give some advises on how to cite the software or provide some publications about the software on its website.This study is part of a large effort to examine software attribution and citation.It will help survey the current status of software use in science and lay a solid foundation for succeeding research in this vein.In addition,it will help complement the current publication-driven science of science research and help build an open,transparent,and inclusive scientific reward system.This study has the following contributions.First,a bootstrapping method proposed in this paper outperforms the baselines including Basilisk,NOMEN,and SPIED for extracting software entities from a full-text corpus of PLOS ONE publications,and this automatic method enables us to study the disciplinary characteristics of software use and citation while similar studies are limited to one or two disciplines.Second,the number of mentions in full-text are used to measure the academic impact of software,and the number of software mentions will make policy makers well understand the value of software and then help them help build an open,transparent,and inclusive scientific reward system.Third,using the software use in full-text to assess the academic impact software will make informetric studies from publication unit to entity unit.

Keywords/Search Tags:

entity extraction, bootstrapping methods, scientific software, impact assessment, the use of software, the citation of software, entity citation analysis

Related items

1	Distribution And Characteristics Analysis Of Multiply References Citation In Scientific Papers
2	Research On Encyclopedic Knowledge Bases Oriented Entity-document Relevance Classification
3	Research On The Non-literature Citation In Bioinformatics Articles
4	Development Of Citation Network Generation Software
5	Scientists' Citation Misbehavior Reflected By The Wrong Citation Records
6	Web Citation Analysis And Bibliography Citation Analysis
7	Research On Academic Influence Evaluation Based On Contribution Weighted Citation
8	The citation process in scientific communication: An analysis of citer motivation and citation characteristics of Chinese physicists
9	Citation Behavior Analysis Of Chinese Documents
10	Research Of Named Entity Relation Extraction Method Based On Bootstrapping