| In recent years,with the rapid growth of Internet data,the number and sorts of scientific and technological resources are also expanding rapidly.However,the growth of information data in number and categories also increases the cost of information acquisition.For science and technology enterprises or users,in addition to general papers,patents,and other contents,policies related to science and technology or the development of their industry should also belong to a kind of science and technology resources.However,the sources of such resources are complex and diverse,which increases the cost and difficulty of obtaining science and technology enterprises and users.Extracting valuable science and technology policy resources from a large number of mixed data and providing accurate and rapid retrieval will help to reduce the cost of information acquisition,which has profound social significance and social utility.The main work of this thesis includes the following aspects:(1)Because of the problems of wide sources,complex contents and structure of policy data in multi-domain and multi-disciplinary scenarios,this paper studies the acquisition method of multi-source policy field resource data,designs a general acquisition and information extraction method suitable for different data sources,and realizes the method of extracting text information from irrelevant page structure by integrating various features of policy page data,solve the problem of obtaining and processing policy resource data in multiple fields and disciplines.(2)In the face of the multi-domain and multi-disciplinary science and technology policy resources mined,realize and provide retrieval and query services.Methods the deep language model Bert was introduced to inject policy domain knowledge through domain pre training.The problem of input length limitation of Bert language model is solved by calculating the relevance and paragraph score aggregation in segments.Finally,the retrieval and sorting results are provided by integrating statistical relevance and semantic relevance.(3)After the multi-domain and multi-disciplinary science and technology policy resources mined,realize and provide retrieval and ranking services.The deep language model Bert was introduced,which injected policy domain knowledge through domain pre-training.The problem of input length limitation of Bert language model is solved by calculating the relevance in segments and aggregating paragraph scores.Finally,the retrieval and ranked results are provided by integrating statistical relevance and semantic relevance. |