| In recent years,the ultra-dense wireless networks and the massive deployment of small base stations have been viewed as one of the most important solutions to the challenge of exponential growth of the wireless data traffic.In order to guarantee the network performance as well as controlling the expenditure,Self-Organized Networks(SON)technology combined with machine learning techniques has attracted much attention.In particular,SON with Reinforcement Learning seems more meaningful as it reduces the human involvement and interacts with the unknown environment automatically.In this work,we focus on the small cell resource allocation problems based on Multi-armed Bandit(MAB)theory.Firstly,we discuss the problem of judiciously setting the base station transmit power that matches its unknown deployment environment.The unique characteristic of this problem is the tradeoff between sufficient indoor coverage and limited outdoor interference.Here we address this problem based on stochastic bandit learning with a continuous set of arms to avoid the constant performance loss as well as heavy workload on initialization caused by crude or excessive sampling within the candidate power interval.To maximize total spectral efficiency,we capture the unimodality of the performance indication function which can efficiently accelerate the search for globally optimal power value.Simulations are performed for single and multiple SBS scenarios.Results are compared with some state-of-the-art solutions and performance gains of our proposed algorithms can be observed.Next,we further involve the inner structure information of wireless communication model and solve the problem of selecting location together with transmit power based on stochastic group-informative MAB theory.Using the path-loss model,we can denote the performance of the SBS with a known function related to the location and transmit power.The performance of different power settings in the same place is affected by the same parameter.Therefore,this problem can be formulated as a stochastic group-informative bandit model.We propose a suitable algorithm for this model and prove its order-optimality rigorously.Simulations also show that our algorithms outperform existing solutions under both static and dynamic scenarios. |