Font Size: a A A

Research On Automatic Denoising Of Field Science And Technology Literature Collection

Posted on:2023-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2558307070953959Subject:Library and Information Science
Abstract/Summary:PDF Full Text Request
The quality of the domain literature set determines the reliability of the analysis results of the domain scientific and technical literature.Based on the natural contradiction of completeness and accuracy in the process of domain literature set construction,the conventional literature set construction methods inevitably bring a certain amount of non-domain related literature.However,previous studies on the quality control of literature sets have mainly focused on data cleaning,which mainly includes the treatment of missing and duplicate values.The task of removing non-domain related literature from a domain literature set often requires a certain degree of domain related knowledge and relies on expert experience or a large amount of manual annotation costs.Can a low-cost and generalizable automatic noise reduction scheme for domain literature sets be constructed based on the current technical conditions? In this study,we introduce a weakly supervised learning PU-learning technique to achieve noise reduction of domain literature sets without manual annotation,and construct a complete automatic noise reduction scheme for literature sets by combining publicly available topic word lists.The impact of the number of subject terms,classifier selection,and positive and negative sample ratio selection on the noise reduction effect is explored through comparative experiments.The two-stage comparison experiments show that the literature set automatic noise reduction scheme of this study performs best in the first stage when the classifier is selected as LGBMClassifier and the second stage when the classifier is selected as Sci Bert.The selection of the number of subject terms is closely related to the quality of subject terms(i.e.,their domain relevance),and the artificial intelligence domain works best when a similarity between subject terms and domain terms of 0.4 is used as a threshold for reference.The ratio of positive and negative samples has little influence on the upper limit of the noise reduction effect,and from the perspective of considering the stability of the noise reduction scheme,it is recommended to set the ratio of positive and negative samples between 1 and 6 according to the specific situation.Meanwhile,in order to evaluate the effectiveness and generalization of the automatic literature set noise reduction scheme proposed in this study in a standard and quantitative way,the initial literature set before noise reduction,the literature set after noise reduction and the ideal literature set are compared in this study under multi-task scenarios.The results show that in most scenarios,the document set after applying the noise reduction scheme designed in this paper performs significantly better than the initial document set before noise reduction,which fully demonstrates the effectiveness of the noise reduction scheme in this paper.
Keywords/Search Tags:domain literature collection, literature collection noise reduction, PU-learning, subject headings, classifier
PDF Full Text Request
Related items