As researchers conduct in-depth research in various fields,a large amount of scientific and technological big data is generated,including technical results,technical reports,scientific literature,and research projects.Scientific and technological big data is essential production material for economic development and scientific and technological production.Governments and enterprises obtain scientific knowledge and predict the direction of technological development from scientific and technological big data,and universities and research institutions mine research content and solve technical problems from scientific and technological big data.In comparison to common industry big data,scientific and technological big data presents cross-domain,cross-media,and cross-temporal characteristics.Processing scientific and technological big data faces challenges of high dimensionality,loose organizational structure,heterogeneity,multiple sources,high uncertainty,high computational complexity,and more.Organizing and processing the significant amount of scientific and technological big data is of great practical significance.The information retrieval,information analysis,and knowledge mining of the scientific and technological big data are inportant for realizing the efficient management,intelligent analysis and comprehensive utilization of scientific and technological big data.The thesis focuses on the theories and technologies of efficient retrieval and intelligent analysis of scientific and technological big data,studying problems of semantic representation learning,knowledge mining,and accurate retrieval of scientific and technological big data.The theis’s main accomplishment includes:1.A Multi-level Self-adaptive Prototypical Networks(MSPN)has been proposed to address the issues of lack of labeled data and data sparsity when learning the semantics of scientific and technological big data in small sample scenarios.The graph meta-learning on the attribute network is constructed using a node representation learning module,an instancelevel adaptive learning module,and a feature-level adaptive learning module,which converts the semantic learning of scientific and technological big data into few-shot node representation learning of the graph.The MSPN model constructs an embedded semantic space for scientific and technological big data,learns the adaptive metric function in the space,and realizes node classification.To reduce dependency on labeled data,a multi-level adaptive learning is conducted based on the idea of the prototypical network.This learning corrects the class prototypes and ensures the maximization of the use of unlabeled sample nodes.Additionally,feature dimensional filtering is conducted at the feature level to ensure that the learned metric functions are adaptive for each class.To address the problem that the information in the support set samples cannot be guaranteed to be maximized in small sample scenarios,an InfoMax Classification-Enhanced Learning Network(ICELN)for semantic representation of scientific and technological big data under small samples has been proposed.The ICELN model utilizes the principle of mutual information maximization to increase the amount of information shared between query nodes and class representations to enhance the graph’s representation.The meta-learning framework ensures that the attributed knowledge and structure knowledge about the graph learned from the meta-training tasks can effectively transfer to the new data in the metatesting tasks,resulting in improved node classification accuracy.Experiments based on several available scientific and technological big data demonstrate the effectiveness of MSPN and ICELN on semantic representation learning tasks.2.A Cluster-Aware Multiplex InfoMax model(CAMI)is proposed to address the problem that existing methods do not take into account the rich substructure embedded in scientific and technological big data.The CAMI model is based on an unsupervised learning framework that uses a network to encode the multi-level substructures in the data.The CAMI model first maps the knowledge graph into a low-dimensional embedding space that retains the semantic and topological information,and then performs analysis on the learned representations.To perform unsupervised learning,an adaptive view generation scheme is proposed to ensure that the generated views maintain the semantic and structural consistency of the data.In the generated views,the semantic information in the learned features is enriched by extracting global features and local features in the data based on multi-level mutual information maximization learning from both views.The experiments conducted on different analysis tasks demonstrate that the proposed CAMI can effectively capture the multilevel structural information in the data and improve performance on different knowledge mining tasks.3.A Contrastive Multi-View Relevance Matching Model(CMRMM)is proposed to address the problem of underutilization of rich interaction signals between query and documents in accurate retrieval of scientific and technological big data.The interaction graph between query and document is constructed by using the words in the document as nodes,the word cooccurrence relationship in the document as link relationship,and the interactions between the query words and the document words as node features.Based on the semantic features of the interaction graph,a valid negative graph is constructed.Graph neural networks are used to extract features from the interaction graph,and the positive graph is generated using the middle layer of the iterative process of the network.The extracted interaction signal is contrastively constrained with positive and negative example data using contrastive learning techniques.The Contrastive MultiView Relevance Matching Model maximizes the use of existing interaction signals,produces contrastive views without increasing the complexity of the model,and reduces the model’s reliance on labeled data.Retrieval experiments on scientific and technical data confirm that the proposed model can effectively capture the deep interaction signals between queries and documents,return high matching query results,and improve the accuracy of scientific and technical big data retrieval.4.Based on the proposed Multi-level Self-adaptive Prototypical Networks(MSPN),InfoMax Classification-Enhanced Learning Network(ICELN),Cluster-Aware Multiplex InfoMax model(CAMI),and Contrastive Multi-View Relevance Matching Model(CMRMM),an efficient system for retrieval and intelligent analysis of scientific and technological big data has been developed.The system comprises three modules:the semantic representation module,the knowledge mining module,and the precise retrieval module of scientific and technological big data,which offers users intelligent analysis and retrieval capabilities for scientific and technological big data. |