1992 FeudalReinforcementLearning

From GM-RKB
Jump to navigation Jump to search

Subject Headings: FeUdal Network (FuN), Feudal Reinforcement Learning (FRL) System, Q-Learning, Hierarchical Reinforcement Learning System, Maze Task.

Notes

Cited By

Quotes

Abstract

One way to speed up reinforcement learning is to enable learning to happen simultaneously at multiple resolutions in space and time. This paper shows how to create a Q-Learning managerial hierarchy in which high level managers learn how to set tasks to their sub-managers who, in turn, learn how to satisfy them. Sub-managers need not initially understand their managers' commands. They simply learn to maximise their reinforcement in the context of the current command.

We illustrate the system using a simple maze task. As the system learns how to get around, satisfying commands at the multiple levels, it explores more efficiently than standard, flat, Q-learning and builds a more comprehensive map.

1 Introduction

Straightforward reinforcement learning has been quite successful at some relatively complex tasks like playing backgammon (Tesauro, 1992). However, the learning time does not scale well with the number of parameters. For agents solving rewarded Markovian decision tasks by learning dynamic programming value functions, some of the main bottlenecks (Singh, 1992b) are temporal resolution - expanding the unit of learning from the smallest possible step in the task, division-and-conquest - finding smaller subtasks that are easier to solve, exploration, and structural generalisation - generalisation of the value function between different 10cations. These are obviously related - for instance, altering the temporal resolution can have a dramatic effect on exploration.

Consider a control hierarchy in which managers have sub-managers, who work for them, and super-managers, for whom they work. If the hierarchy is strict in the sense that managers control exactly the sub-managers at the level below them and only the very lowest level managers can actually act in the world, then intermediate level managers have essentially two instruments of control over their sub-managers at any time - they can choose amongst them and they can set them sub-tasks. These sub-tasks can be incorporated into the state of the sub-managers so that they in turn can choose their own sub-sub-tasks and sub-sub-managers to execute them based on the task selection at the higher level.

An appropriate hierarchy can address the first three bottlenecks. Higher level managers should sustain a larger grain of temporal resolution, since they leave the sub-sub-managers to do the actual work. Exploration for actions leading to rewards can be more efficient since it can be done non-uniformly - high level managers can decide that reward is best found in some other region of the state space and send the agent there directly, without forcing it to explore in detail on the way.

Singh (1992a) has studied the case in which a manager picks one of its sub-managers rather than setting tasks. He used the degree of accuracy of the Q-values of sub-managerial Q-Learners (Watkins, 1989) to train a gating system (Jacobs, Jordan, Nowlan & Hinton, 1991) to choose the one that matches best in each state. Here we study the converse case, in which there is only one possible sub-manager active at any level, and so the only choice a manager has is over the tasks it sets. Such systems have been previously considered (Hinton, 1987; Watkins, 1989).

The next section considers how such a strict hierarchical scheme can learn to choose appropriate tasks at each level, section 3 describes a maze learning example for which the hierarchy emerges naturally as a multi-grid division of the space in which the agent moves, and section 4 draws some conclusions.

2 Feudal Control

We sought to build a system that mirrored the hierarchical aspects of a feudal fiefdom, since this is one

extreme for models of control. Managers are given absolute power over their sub-managers - they can set them tasks and reward and punish them entirely as they see fit. However managers ultimately have to satisfy their own super-managers, or face punishment themselves - and so there is recursive reinforcement and selection until the whole system satisfies the goal of the highest level manager. This can all be made to happen without the sub-managers initially "understanding" the sub-tasks they are set. Every component just acts to maximise its expected reinforcement, so after learning, the meaning it attaches to a specification of a sub-task consists of the way in which that specification influences its choice of sub-sub-managers and sub-sub-tasks. Two principles are key:

 Reward Hiding Managers must reward sub-managers for doing their bidding whether or not this satisfies the commands of the super-managers. Sub-managers should just learn to obey their managers and leave it up to them to determine what it is best to do at the next level up. So if a sub-manager fails to achieve the sub-goal set by its manager it is not rewarded, even if its actions result in the satisfaction of of the manager's own goal. Conversely, if a sub-manager achieves the sub-goal it is given it is rewarded, even if this does not lead to satisfaction of the manager's own goal. This allows the sub-manager to learn to achieve sub-goals even when the manager was mistaken in setting these sub-goals. So in the early stages of learning, low-level managers can become quite competent at achieving low-level goals even if the highest level goal has never been satisfied.

 Information Hiding Managers only need to know the state of the system at the granularity of their own choices of tasks. Indeed, allowing some decision making to take place at a coarser grain is one of the main goals of the hierarchical decomposition. Information is hidden both downwards - sub-managers do not know the task the super-manager has set the manager - and upwards - a super-manager does not know what choices its manager has made to satisfy its command. However managers do need to know the satisfaction conditions for the tasks they set and some measure of the actual cost to the system for achieving them using the sub-managers and tasks it picked on any particular occasion.

For the special case to be considered here, in which managers are given no choice of which sub-manager to use in a given state, their choice of a task is very similar to that of an action for a standard Q-Learning system. If the task is completed successfully, the cost is determined by the super-manager according to how well (e.g. how quickly, or indeed whether) the manager satisfied its super-tasks. Depending on how its own task is accomplished, the manager rewards or punishes the submanager responsible. When a manager chooses an action, control is passed to the sub-manager and is only returned when the state changes at the managerial level.

3 The Maze Task

4 Discussion

Acknowledgements

References

BibTeX

@inproceedings{1992_FeudalReinforcementLearning,
  author    = {Peter Dayan and
               Geoffrey E. Hinton},
  editor    = {Stephen Jose Hanson and
               Jack D. Cowan and
               C. Lee Giles},
  title     = {Feudal Reinforcement Learning},
  booktitle = {Proceedings of Advances in Neural Information Processing Systems 5 (NIPS 1992)},
  address   = {Denver, Colorado, USA},
  data      = {November 30 -- December 3, 1992},
  pages     = {271--278},
  publisher = {Morgan Kaufmann},
  year      = {1992},
  url       = {http://papers.nips.cc/paper/714-feudal-reinforcement-learning},
}


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1992 FeudalReinforcementLearningGeoffrey E. Hinton
Peter Dayan
Feudal Reinforcement Learning1992