Healthcare is an industry that serves the entire population.With the increasing abundance of medical data,to make full use of medical text data,obtain valuable information from it,and apply it to real life,it is the need of the medical industry to adapt to the development of the times.In this thesis,the topic model is used to conduct in-depth analysis of disease text data,build a disease knowledge base to realize disease question and answer analysis,which will help patients understand the disease according to their own symptoms,assist doctors in making clinical decisions,and provide technical support for analyzing the development trend of disease and self diagnosis.The specific research contents are as follows.(1)Aiming at the problem that the importance of different parts of speech in the disease text data is different,it is proposed to set different contribution weights according to the parts of speech.First,construct the medical professional vocabulary word segmentation dictionary.Then,the disease text data is filtered,Chinese word segmentation,part of speech tagging and stop word removal.Finally,according to the corresponding part of speech,the part-of-speech contribution weight is annotated on the word vector after Global Vectors for Word Representation modeling.Then the disease text vector is calculated.(2)Aiming at the problem that the K-Medoide clustering algorithm has low accuracy in calculating the similarity,the LG&K-Medoide algorithm is proposed.Using Latent Dirichlet Allocation and Glo Ve similarity combined with improved distance function method,the subject clusters of departments were obtained.First,LDA is used to model the disease text,and the Jensen–Shannon distance is used to calculate the text-similarity.Secondly,use Glo Ve modeling to obtain word vectors,label the word vector weights according to the contribution of disease parts of speech,and use cosine distance to calculate the text-similarity weighted based on Glo Ve modeling.Finally,K-Medoide clustering is optimized using the similarity combined with the improved distance formula.(3)Aiming at the problem of a single model of the existing disease analysis system,a disease analysis system based on the LDA topic model is built.First,the demand analysis and frame design of the disease analysis system is carried out.Secondly,build a disease knowledge base containing the relationship between entities such as diseases,symptoms,departments,drugs,and examination methods.Then,set up visual interfaces for disease symptom analysis,department disease analysis and disease question and answer analysis.Finally,extract the symptom text in My SQL database and search the answers in Neo4 j diagram database for analysis and display,to realize the functions of disease analysis and disease question and answer.In summary,the disease text clustering algorithm based on the LDA topic model proposed in this thesis has higher clustering accuracy on the disease text data set.The constructed disease analysis system based on the LDA topic model helps patients to obtain corresponding guidance according to their own symptoms at any time,lays a foundation for the application of topic models in the field of medical analysis,and provides new ideas for autonomous disease diagnosis. |