Font Size: a A A

Domain-specific Information Directional Collection And Multidimensional Search System

Posted on:2018-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:F Y BaiFull Text:PDF
GTID:2348330515959773Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Decision-making relies on experience,intuition and logic,but with the era of big data,especially in specific domains,decision-making more and more rely on data-driven.In many domains,a plethora of textual information is available on the web as news reports,blog posts,community portals,etc.Multidimensional search is a sort of search technology based on sorting and category,but systematically applying these techniques to web input requires highly complex systems,starting from crawler over quality assurance methods to cope with the HTML input to long pipelines of natural language processing and multidimensional search technology.In this paper,a web-oriented data collection and multidimensional search service system is designed for specific domains by combining the technologies of distributed crawler,data cleaning,text analysis and multidimensional search.The focus of this paper is based a real-life use case to design an easy-to-use and scalable system for domain-specific text analysis at web scale.The main parts of this paper are listed as follows:Firstly,based on the requirements of data collection for specific domain,this paper builds a precise and general distributed crawler.Different from the full-text crawler,this crawler requires to rapid collection of deep,precise and structured data from the web.Secondly,a pipeline is responsible for processing the crawled data,including HTML repair,duplication check,text segmentation and entity recognition.Thirdly,in order to make the system more interactive,the system provides information retrieval service for the collection data.In addition,multidimensional search is provided to assist full-text retrieval service according to the domain concept system.
Keywords/Search Tags:distributed crawler, information extraction, multidimensional search
PDF Full Text Request
Related items