Inverse Reinforcement Learning Task

From GM-RKB
(Redirected from plan recognition)
Jump to navigation Jump to search

An Inverse Reinforcement Learning Task is an apprenticeship learning task that ...



References

2019

  • (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Apprenticeship_learning#Via_inverse_reinforcement_learning Retrieved:2019-2-14.
    • Inverse reinforcement learning (IRL) is the process of deriving a reward function from observed behavior. While ordinary "reinforcement learning" involves using rewards and punishments to learn behavior, in IRL the direction is reversed, and a robot observes a person's behavior to figure out what goal that behavior seems to be trying to achieve. The IRL problem can be defined as: [1]

      Given 1) measurements of an agent's behaviour over time, in a variety of circumstances; 2) measurements of the sensory inputs to that agent; 3) a model of the physical environment (including the agent's body): Determine the reward function that the agent is optimizing.

      IRL researcher Stuart J. Russell proposes that IRL might be used to observe humans and attempt to codify their complex "ethical values", in an effort to create "ethical robots" that might someday know "not to cook your cat" without needing to be explicitly told. The scenario can be modeled as a "cooperative inverse reinforcement learning game", where a "person" player and a "robot" player cooperate to secure the person's implicit goals, despite these goals not being explicitly known by either the person nor the robot. [2] In 2017, OpenAI and DeepMind applied deep learning to the cooperative inverse reinforcement learning in simple domains such as Atari games and straightforward robot tasks such as backflips. The human role was limited to answering queries from the robot as to which of two different actions were preferred. The researchers found evidence that the techniques may be economically scalable to modern systems. [3] Apprenticeship via inverse reinforcement learning (AIRP) was developed by in 2004 Pieter Abbeel, Professor in Berkeley's EECS department, and Andrew Ng, Associate Professor in Stanford University's Computer Science Department. AIRP deals with “Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform". AIRP has been used to model reward functions of highly dynamic scenarios where there is no obvious reward function intuitively. Take the task of driving for example, there are many different objectives working simultaneously - such as maintaining safe following distance, a good speed, not changing lanes too often, etc. This task, may seem easy at first glance, but a trivial reward function may not converge to the policy wanted. One domain where AIRP has been used extensively is helicopter control. While simple trajectories can be intuitively derived, complicated tasks like aerobatics for shows has been successful. These include aerobatic maneuvers like - in-place flips, in-place rolls, loops, hurricanes and even auto-rotation landings. This work was developed by Pieter Abbeel, Adam Coates, and Andrew Ng - "Autonomous Helicopter Aerobatics through Apprenticeship Learning" [4]
  1. Parr, R., & Russell, S. J. (1998). Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems (pp. 1043-1049).
  2. Hadfield-Menell, D., Russell, S. J., Abbeel, Pieter & Dragan, A. (2016). Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems (pp. 3909-3917).
  3. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems (pp. 4302-4310).
  4. Pieter Abbeel, Adam Coates, Andrew Ng, “Autonomous Helicopter Aerobatics through Apprenticeship Learning.” In Vol. 29, Issue 13 International Journal of Robotics Research. 2010.

2011