Font Size: a A A

An Improved Random Forest Parallel Classification Method And Its Application To Big Data Of Telecom Operators

Posted on:2016-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2308330473952262Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Telecom operators provide telecommunication services for telecommunication consumers, thus they can achieve abundant data resources. To discover the valuable information, a customer classification system for second-hand property intermediaries is designed and implemented, in which the improved Random Forest, MapReduce parallel computing framework technology, cluster analysis and other large data processing technology, as well as mathematical statistics, data analysis method of complex network, web crawlers technology are used, based on the big data of telecom operators. By this system, it is feasible to obtain the information of potential customers of real estate intermediaries from all the information, and to divide them into five categories, namely the tenants, rental, property buyers, sellers, other categories, which is helpful to accurate marketing.The classification algorithm is the core of the whole system, and this paper proposes an improved random forest classification algorithm, including the three improvements:(1) Through the mathematical and experimental proof, the equilibrium data, increased the sample size of repeated sampling, can improve the accuracy rate(2) By using equivalent simple random sampling method instead of the original repeated sampling, it reduces the running time of the algorithm, improving the efficiency of the system(3) Regression analysis is used to get the quantitative relationship, which is Y e.., between the degree of imbalance and repeated sampling. And by this equation and the imbalance of the big data of telecom operators, the best size of repeated sampling can be found.The system consists of four core systems, namely data acquisition subsystem, data preprocessing subsystem, data analysis subsystem and the feedback adjustment subsystem. Data acquisition subsystem mainly collects real estate agent information. Data Preprocessing Subsystem retains the real estate brokers call records by filtering out irrelevant real estate brokers call records based on parallel processing technology, and extracts potential customers and all their behavior information. Data analysis subsystem is mainly used to classify potential customers by using improved Random Forest algorithm. When the system is in the cold start phase without training samples, visual dimension graphs constructed by means of R language of mathematical statistics, visual interaction networks built by using software Cytoscape, the initial sample set analyzed by using clustering analysis method, help to obtaining training samples and tease the characteristic dimension combinations. Feedback adjustment subsystem is necessary to add the labeled samples obtained in the subsequent operation of the system, which meet the conditions, into the training sample library. By this ways, it can continue to adjust the data analysis subsystem, which can make the system more accurate.By using the improved Random Forest classification algorithm into the customer classification system based on the big data of second-hand property intermediary operation, the generation error rate of the system is about 21.1379%, which is 0.3895% lower than the original rate when it is not improved(21.5274%). Generally speaking, according to the classification system of improved Random Forest algorithm the accuracy rate is about 79%, which is valuable to real estate sales.
Keywords/Search Tags:random forest, balance data, non repeated sampling, telecom operators, UML
PDF Full Text Request
Related items