Font Size: a A A

Design And Implementation Of Real-time Multidimensional Retrieval System For Text Data

Posted on:2020-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:J ChaiFull Text:PDF
GTID:2428330596981808Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous popularization of networks and information technology,global big data has entered a period of rapid development,and the total amount of data has increased by 50% annually.Among them,text data represented by social platforms(WeChat,Weibo,etc.)account for the vast majority.For large amounts of text data containing a large amount of information,conventional text retrieval methods usually do not achieve the desired results.How to effectively retrieve massive amounts of text data in order to further explore the potential value of data needs to be resolved.The industry has carried out various explorations for this purpose,among which Microsoft Concept Graph technology is deeply researched in this field and different from the traditional text data solution.The Microsoft Concept Map is intended to map text format entities into semantic concept categories with some probabilities.This paper combines existing data crawling,text processing and other technologies to climb ecommerce poverty alleviation real-time text data,and combines with Microsoft Concept Graph technology to propose a new way to extract real-time text data dimensions,and combine the extraction of dimension information to construct text.Data multidimensional retrieval system.The scheme studies from the aspects of text data acquisition,dimension extraction and multidimensional retrieval system construction:1)Design text data source storage format specification,design data acquisition module according to system storage specification and multi-dimensional retrieval system requirements.Research anti-climbing measures on Weibo and WeChat platforms,combine existing Reptile technology,combine Redis memory database,Scrapy framework,cloud coding platform and other technologies to refine the crawl time slice and construct user-definable topic keywords.Million-level data volume,high-performance robust crawler,can crawl Weibo and WeChat data in real time;2)Combining K-means algorithm,Microsoft concept map and other technologies,extract the dimensional information in the text data set,and construct a multidimensional retrieval module in the text data multi-dimensional retrieval system.The user can combine the "dimension","time","region" and other information to retrieve the matching data set,and the user can also export the text data set for subsequent finegrained and customized analysis.In order to overcome the difficulty of real-time crawling of Weibo WeChat data,this paper combines the Flask framework with the Redis in-memory database to maintain the cookie pool to increase the crawler defense strategy,and combines Scrapy to increase the crawling efficiency;for the microblogging WeChat platform to shield its own historical data,through the fine Crawling time slices to achieve high-volume,highperformance robust crawling crawling microblogging and WeChat data;for the difficulty of extracting regular text data dimensions,this paper combines text clustering method with Microsoft concept map technology,through K-means algorithm The topic key clusters clustered out of the text data set are input into the Microsoft concept map to obtain the dimension score of the keyword cluster.Then the dimensional information of the data set is calculated by the dimension calculation formula,and the multidimensional retrieval system is constructed by using these dimensional information.The method has strong practicability and scalability,and provides new ideas for multidimensional retrieval of text data,and improves the efficiency of text data retrieval.
Keywords/Search Tags:Web Crawler, Text analysis, Multidimensional retrieval, Microsoft Concept Graph, E-commerce poverty alleviation
PDF Full Text Request
Related items