Font Size: a A A

Research Of Data Acquisition On Cross-Language Public Opinion Analysis

Posted on:2016-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:S Y S M Y L YaFull Text:PDF
GTID:2308330476950405Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Massive Internet users made internet public opinion as an important part of social public opinion. Most of the community data resources on the Internet are concentrated in social networks which Microblogs, Forums and News websites and Instant Messaging tools which QQ and WeChat. Cross-language social public opinion analysis is a hotspot of intelligent information processing in China, Demand in China’s minority areas and the surrounding countries, and cross-language feature of the dissemination of social public opinion, there is an urgent need to study basic theory and key technology of cross-language social public opinion analysis. Uyghur is one of the major minority language in China. In order to building a good cross-language public opinion analysis system, research of Uighur social network of public opinion analysis is particularly important. Efficiently and correctly to get network public opinion data is the most important foundation work of public opinion analysis.In this paper research in four areas witch selection of Public opinion data sources, design of focused Web crawler, public opinion data collection solution design and public opinion data extraction program design. Finally successfully design and implementation of a Uighur social networking-oriented public opinion data acquisition platform.At present, the Uighur social network public opinion data acquisition studies at an early stage. The biggest difficulty of this study is Uighur MicroBlog developers do not provide open API, and this situation increases the difficulties of Uighur MicroBlog public opinion data acquisition work. Another difficulty is that the encoding of the Uighur websites and website structure are so different from Chinese or English websites, so the current popular web crawler is not suitable for access to Uighur social network data.In this paper, the most typical 10 sites in Uighur MicroBlog, BBS and news site selected as the experimental subjects of the source of public opinion data. Because of different sites have different structure, and in order to ensure the ultimate data accuracy and integrity, this paper used user relation based Web crawler, which for each Web site design dedicated crawler. In order to gets a large amount of source data, there is a need to obtain historical data, for this we used depth-first search and breadth-first search method. In the area of data collection, in order to accurately judge the data updates, we used incremental data collection method, and in order to obtain specific target data in a specific Web site, using the data acquisition method based on user personalization. As Uighur website coding and page layout features, using manual methods in data extraction. In order to improve the speed of data access and guarantee the independence of the Web crawler, on the overall structure of the data acquisition platform we used based on substation data acquisition methods. In order to solve these problems of Micro-blogging website does not provide API, this paper presents user relation based Uighur MicroBlog data acquisition method. Due to three types of Web sites in this study has obvious similarity of the layout on the page layout features, so in this paper we used page layout similarity data access methods.Through the above-mentioned research, this paper realized efficient Uighur social network public opinion data acquisition platform. Through the study of 10 sites, got more than 400,000 pieces of high quality, high precision data, and provided a Uighur social network public opinion data-capturing technology and a wealth of data resources for research of cross-language public opinion analysis.
Keywords/Search Tags:Cross-language, Network Public Opinion, Web Crawler, Data acquisition, Data extraction, User Relationship, Social network
PDF Full Text Request
Related items