Yazar "Brown, Martin" seçeneğine göre listele
Listeleniyor 1 - 5 / 5
Sayfa Başına Sonuç
Sıralama seçenekleri
Öğe An analysis of value function learning with piecewise linear control(Taylor & Francis Ltd, 2016) Tutsoy, Önder; Brown, MartinReinforcement learning (RL) algorithms attempt to learn optimal control actions by iteratively estimating a long-term measure of system performance, the so-called value function. For example, RL algorithms have been applied to walking robots to examine the connection between robot motion and the brain, which is known as embodied cognition. In this paper, RL algorithms are analysed using an exemplar test problem. A closed form solution for the value function is calculated and this is represented in terms of a set of basis functions and parameters, which is used to investigate parameter convergence. The value function expression is shown to have a polynomial form where the polynomial terms depend on the plant's parameters and the value function's discount factor. It is shown that the temporal difference error introduces a null space for the differenced higher order basis associated with the effects of controller switching (saturated to linear control or terminating an experiment) apart from the time of the switch. This leads to slow convergence in the relevant subspace. It is also shown that badly conditioned learning problems can occur, and this is a function of the value function discount factor and the controller switching points. Finally, a comparison is performed between the residual gradient and TD(0) learning algorithms, and it is shown that the former has a faster rate of convergence for this test problem.Öğe Chaotic dynamics and convergence analysis of temporal difference algorithms with bang-bang control(Wiley, 2016) Tutsoy, Önder; Brown, MartinReinforcement learning is a powerful tool used to obtain optimal control solutions for complex and difficult sequential decision making problems where only a minimal amount of a priori knowledge exists about the system dynamics. As such, it has also been used as a model of cognitive learning in humans and applied to systems, such as humanoid robots, to study embodied cognition. In this paper, a different approach is taken where a simple test problem is used to investigate issues associated with the value function's representation and parametric convergence. In particular, the terminal convergence problem is analyzed with a known optimal control policy where the aim is to accurately learn the value function. For certain initial conditions, the value function is explicitly calculated and it is shown to have a polynomial form. It is parameterized by terms that are functions of the unknown plant's parameters and the value function's discount factor, and their convergence properties are analyzed. It is shown that the temporal difference error introduces a null space associated with the finite horizon basis function during the experiment. The learning problem is only non-singular when the experiment termination is handled correctly and a number of (equivalent) solutions are described. Finally, it is demonstrated that, in general, the test problem's dynamics are chaotic for random initial states and this causes digital offset in the value function learning. The offset is calculated, and a dead zone is defined to switch off learning in the chaotic region. Copyright (C) 2015 John Wiley & Sons, Ltd.Öğe Chaotic dynamics and convergence analysis of temporal difference algorithms with bang-bang control(Wiley, 2016) Tutsoy, Onder; Brown, MartinReinforcement learning is a powerful tool used to obtain optimal control solutions for complex and difficult sequential decision making problems where only a minimal amount of a priori knowledge exists about the system dynamics. As such, it has also been used as a model of cognitive learning in humans and applied to systems, such as humanoid robots, to study embodied cognition. In this paper, a different approach is taken where a simple test problem is used to investigate issues associated with the value function's representation and parametric convergence. In particular, the terminal convergence problem is analyzed with a known optimal control policy where the aim is to accurately learn the value function. For certain initial conditions, the value function is explicitly calculated and it is shown to have a polynomial form. It is parameterized by terms that are functions of the unknown plant's parameters and the value function's discount factor, and their convergence properties are analyzed. It is shown that the temporal difference error introduces a null space associated with the finite horizon basis function during the experiment. The learning problem is only non-singular when the experiment termination is handled correctly and a number of (equivalent) solutions are described. Finally, it is demonstrated that, in general, the test problem's dynamics are chaotic for random initial states and this causes digital offset in the value function learning. The offset is calculated, and a dead zone is defined to switch off learning in the chaotic region. Copyright (C) 2015 John Wiley & Sons, Ltd.Öğe Reinforcement learning analysis for a minimum time balance problem(Sage Publications Ltd, 2016) Tutsoy, Önder; Brown, MartinReinforcement learning was developed to solve complex learning control problems, where only a minimal amount of a priori knowledge exists about the system dynamics. It has also been used as a model of cognitive learning in humans and applied to systems, such as pole balancing and humanoid robots, to study embodied cognition. However, closed-form analysis of the value function learning based on a higher-order unstable test problem dynamics has been rarely considered. In this paper, firstly, a second-order, unstable balance test problem is used to investigate issues associated with the value function parameter convergence and rate of convergence. In particular, the convergence of the minimum time value function is analysed, where the minimum time optimal control policy is assumed known. It is shown that the temporal difference error introduces a null space associated with the experiment termination basis function during the simulation. As this effect occurs due to termination or any kind of switching in control signal, this null space appears in temporal differences (TD) error for more general higher-order systems. Secondly, the rate of parameter convergence is analysed and it is shown that residual gradient algorithm converges faster than TD(0) for this particular test problem. Thirdly, impact of the finite horizon on both the value function and control policy learning has been analysed in case of unknown control policy and added random exploration noise.Öğe Reinforcement learning analysis for a minimum time balance problem(Sage Publications Ltd, 2016) Tutsoy, Onder; Brown, MartinReinforcement learning was developed to solve complex learning control problems, where only a minimal amount of a priori knowledge exists about the system dynamics. It has also been used as a model of cognitive learning in humans and applied to systems, such as pole balancing and humanoid robots, to study embodied cognition. However, closed-form analysis of the value function learning based on a higher-order unstable test problem dynamics has been rarely considered. In this paper, firstly, a second-order, unstable balance test problem is used to investigate issues associated with the value function parameter convergence and rate of convergence. In particular, the convergence of the minimum time value function is analysed, where the minimum time optimal control policy is assumed known. It is shown that the temporal difference error introduces a null space associated with the experiment termination basis function during the simulation. As this effect occurs due to termination or any kind of switching in control signal, this null space appears in temporal differences (TD) error for more general higher-order systems. Secondly, the rate of parameter convergence is analysed and it is shown that residual gradient algorithm converges faster than TD(0) for this particular test problem. Thirdly, impact of the finite horizon on both the value function and control policy learning has been analysed in case of unknown control policy and added random exploration noise.