Font Size: a A A

Design And Implementation Of Biomedical Literature Analysis System

Posted on:2020-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2438330605962999Subject:Engineering
Abstract/Summary:PDF Full Text Request
The biomedical literature is an effective means of recording,accumulating,disseminating and inheriting biomedical knowledge,and is the most basic and important way for biomedical researchers to acquire and exchange knowledge in the field.With the rapid development of biomedical science and technology,the biomedical literature has shown an exponential growth.How to extract potential rules and knowledge from these massive biomedical literatures has become one of the hot issues in bioinformatics.This paper designed and implemented a biomedical literature analysis system based on MedLine database,PubMed search engine,web crawler technology and data mining algorithm.This system realizeed the acquisition of biomedical literature data,the preprocessing of literature data,multidimensional statistical analysis,cluster analysis and visualization of results.The biomedical literature analysis system has the advantages of occupying less resources,being lighter and more convenient,and could assist users to deeply explore the inherent laws of biomedical literature.The system has provided users with relevant hot words,research teams,mainstream journals,regional heat,research trends and literature collection in related fields to help users quickly understand the scientific research trends and make accurate scientific research decisions.The above applications have shown the application value of this system.The main research content of this paper included four aspects:(1)The technology about web crawler.The biomedical literature analysis system connected to the MedLine database through the PubMed search engine.Based on the presentation form and storage structure of the document on the page,the system used the XPath path to locate the page and information,and used the depth-first strategy to crawl the biomedical literature data associated with the search term.(2)Data preprocessing and analysis.The data preprocessing and analysis part of the biomedical literature analysis system consisted of four modules,namely the preprocessing module,the statistical module,the model building module and the clustering module.Based on the grammatical features and word characteristics of English biomedical literature data,the preprocessing module decided to clean the literature data by removing HTML tags,word segmentation,stop words,spelling proofing,and the like.The statistical module was responsibled for statistical information such as titles,keywords,abstracts,authors,journals,and national regions of the biomedical literature.The statistical results revealed information such as related hot words,research teams,mainstream journals,regional heat and research trends in related fields.The model building module calculated the weights of the candidate feature wordsaccording to the optimized TF-IDF algorithm,and selected representative feature words to form a word frequency matrix.The clustering module calculated the cosine of the angle of the document vector based on the word frequency matrix constructed a similarity matrix and then called the Ward Method to cluster the literature.(3)Optimize the TF-IDF algorithm.The traditional TF-IDF algorithm is easily affected by the neglect of word position factors and the distribution between classes.This paper optimized the TF-IDF algorithm from both the TF factor and the IDF factor to improve the performance of the algorithm.The system integrated the position contribution degree and the part-of-speech contribution degree into the TF factor,and combined the word part of the word while considering the importance of the document position of the feature word.The IDF factor took the probability of feature words among different classes as the starting point,introduced the dimension factor,and took the probability of the feature words in this class and other classes as the basis for calculation.This method solved the disadvantages of the traditional TF-IDF algorithm focus on word frequency but neglect the distribution between classes to some extent,and improved the stability and accuracy of the algorithm in processing high-dimensional data.(4)Visualization of the results.The biomedical literature analysis system took science,aesthetics,and simplicity as development principles,used the Python library tkinter,Pyecharts,Matplotlib,etc.to complete the development of the interface and result display module.The system presentation forms included word cloud diagram,histogram,pie chart,tree structure diagram,etc.,which was beneficial to the user to obtain the result intuitively.
Keywords/Search Tags:Biomedical literature, Literature analysis, Web crawler, Literature clustering
PDF Full Text Request
Related items