Font Size: a A A

Distributed Computing For Large-scale Data With Group Dantzig Selector

Posted on:2015-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:L LiFull Text:PDF
GTID:2268330431950081Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the development of network technology and the improvement of people’s living standard, the exchange of information and communication becomes more and more popular and daily. It brings convenience to our life, as well as generates huge amounts of data. The massive data brings convenience to our information exchange, but more and more burden on centralized computing begins to appear. Data are becoming the most important assets, and the ability of data analysis is becoming the core competitiveness. People’s work in mining massive data indicate a new round of innovation and productivity tide.Informatization is the important direction of social development, and the core is the data. Data is the partial presentation of informatization. External package can be anything, but the internal flow is always data. The role data plays in our daily life is quietly changing. Large-scale data mainly come from cloud computing, Internet of things, mobile Internet and so on. In the process of the development of the society and the technology, we’ll face with various problems about data processing. The development of the society and the technology need to break through the bottleneck. In this background, how to achieve distributed data storage and distributed computing becomes more and more important. This paper aims to study the distributed computing, in order to achieve the purpose of improving operation efficiency.Distributed computing has brought the gospel to the large-scale computing. It breaks the original computing tasks into smaller subtasks to calculate operating on each sub node in parallel. Not only it balance the computing load, but also it improves efficiency, meeting the requirements of high speed, high energy and high power in information age. The goal of distributed computing is to solve the large and complex computational problems fast. Distributed computing is relative to the centralized computing, mainly researching the distributed computing tasks in sub nodes. Distributed computing can make two or more software to share information with each other. The software can run on the same computer, and can also run on multiple computers connected through a network. As the scale of computing task in many fields continues to improve, people’s requirements on computer performance become higher and higher. Huge computational tasks on a single computer can often not be completed. High performance computer is difficult to popularize because of the high price. How to use the distributed computing to greatly improve the performance of the computer system in parallel has become a hot topic in the fields of computer.A efficient and reasonable distributed framework should consider task re-quirements and processor operation, and assign different tasks to corresponding processor, avoiding unnecessary waiting processor time. In distributed system, a computing task is often divided into several sub tasks and assigned on different processor in parallel, so as to reduce the task running cycle and improve the sys-tem throughput. However, sub tasks often obey constraint sequential, meaning that one task can’t run until its predecessor task task finished. So how to as-sign tasks reasonably and reduce processor waiting time is the key to improve the efficiency of the system.This thesis focuses on distributed frame, Spark, to solve group Dantzig Selec-tor with Alternating Direction Method of Multipliers (ADMM). We improve the computing efficiency with distributed computing. Compared with the traditional centralized computing, distributed computing saves the operation cost and elimi-nates redundant computation waiting time. The main contents of this thesis are as follows:(1)Utilizing the piecewise linear characteristics of the solutions of Dantzig selector, a variant of DASSO come out to solve Dantzig Selector. Then we compare it with the linearized Alternating Direction Method of Multipliers (LADMM), and find our method’s performance is better.(2)We utilizing ADMM and LADMM to solve the group Dantzig Selector, and overcome various difficulties in constraints of group Dantzig selector with introducing intermediate variables to simplify the computing.(3)We build a distributed computing platform framework on the server, cre-ate virtual machines and utilizing Spark to Implement ADMM algorithm to solve the group Dantzig selector.Then we compare the distributed computing with cen-tralized computing, getting the result that the distributed computing performs better than centralized computing.The group Dantzig Selector for the special group sparse group structure of linear regression model performs well. One typical example is Electroencephalog-raphy(EEG) experiment. EEG records the brain’s spontaneous electrical activity by measuring the voltage fluctuations over64electrodes placed on the scalp with a frequence of256Hz. After recording the feature of sample people, we can build the linear regression model. Then based on the data measured, we can calculate the model parameters to facilitate the prediction later. Distributed computing has significant meaning for the increasingly scaled data. This is also an important starting point for the research and writing.
Keywords/Search Tags:group Dantzig Selector, large scale data, distributed computing, al-ternating direction method of multipliers, linearized alternating direction methodof multipliers, Spark
PDF Full Text Request
Related items