Font Size: a A A

Research And Implementation Of Web Text Mining System Based Mapreduce

Posted on:2014-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:S H LiFull Text:PDF
GTID:2248330398472116Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
As the era of Internet media’s Mature and improve, more and more media information through this quick and inexpensive way to publish and transfer. Along with the depth of the Internet applications, the unusually large amount of information on the network is on the explosive growth now. The search engine can help us get more accurate information from the Internet pages, but the obtained Information is primary and broad, and can not confirm these inter-relate information and entity model, and still need to carry out further analysis and processing. At this time, an optional method is to draw a common network analysis, and to make relational mining and model analysis on the heterogeneous Web information, and to discover the potential and valuable knowledge.The paper studies MongoDB distributed databases and Hadoop distributed computing framework, and design an efficient design Web news entity analysis program based on the MongoDB data modeling and Hadoop MapReduce computation frame.The main works are listed as follows:1. Based on XML parsing method, semi-structured analysis on Sougou Lab’s news Data to extract the appropriate information, and carry word segmentation on text content processing based MapReduce framework, and use TF-IDF algorithms to calculate keyword’s weight, and extracted text feature expression finally.2. Based on the the MongoDB data model as well as parallel processing, combining relational network analysis algorithms, center degree algorithm to analyze a single entity node in the center of the entity-relationship network potential to achieve core mining news topics; combined Cohesive Subgroup analysis, mine close link between a small group to build the model of the block between the entities.3. Application of document-based non-relational database MongoDB, using its powerful modeling capabilities, designed to describe the data model of text features, combined with the Hadoop MapReduce parallel computing framework and J2EE architecture, to complete the distributed storage complete Web News the analysis and the design and construction of the computing platform, and to get the results of the use of the JUNG technology show.
Keywords/Search Tags:Web Mining, MapReduce, MongoDB, Socail NetworkAnalysis, Named Entity
PDF Full Text Request
Related items