SE367 HW2: Hemangini Parmar

One of the theories behind the acquiring of a complex behavior by the animals is by learning to obtain rewards and to avoid punishments. Among the available computational models, reinforcement learning aims to model this kind of behavior.

A robot that could learn and improve its performance over repeated training episodes is better than one which is programmed with a fixed algorithm because it is difficult for a human brain to describe the task with so much precision that an adequate algorithm could be generated out of it.

Reinforcement learning follows framework of Markov decision processes (MDPs). In this framework, the agent and environment interact in a sequence of discrete time steps, t=0,1,2,3...On each step, the agent perceives the environment to be in a state, st, and selects an action, at . In response, the environment makes a stochastic transition to a new state, st +1, and stochastically emits a numerical reward, rt+1(see figure ). The agent seeks to maximize the reward it receives in the long run. For example, the most common objective is to choose each action at so as to maximize the expected discounted return:

                                                          E{rt+1+yrt+2+y2rt+3+ �}

where y is a discount-rate parameter, 0=y=1, akin to an interest rate in economics.

           Fig: The Reinforcement learning framework

As an example of a Markov decision process, consider a hypothetical experiment in which a rat presses levers to obtain food in a cage with a light. Suppose that if the light is off, pressing lever A turns on the light back on with a certain probability, and pressing lever B has no effect. When the light is on, pressing lever A has no effect, but pressing lever B delivers food with a certain probability, and turns the light off again. In this simple environment there are two relevant states: light on and light off. Lever A may cause a transition from light off to light on; in light on, lever B may yield a reward. The only information that the rat needs to decide what to do is whether the light is on or off. The optimal policy is simple: in light off press lever A; in light on, press lever B.

We could design a reinforcement learning algorithm for the robot in order to make it capable of learning and improving its efficiency. Consider the subtask: picking up the pencil . In order to make the robot learn movements that are fast and smooth, it will have to control the motors directly and have low-latency information about the current positions and velocities of the mechanical linkages. The actions in this case might be the voltages applied to each motor at each joint, and the states might be the latest readings of joint angles and velocities. The reward might be +1 every time pencil is successfully picked. To encourage smooth movements, on each time step a small, negative reward can be given as a function of the moment-to-moment �jerkiness� of the motion. We draw the agent- environment interface in such a way that the task specific parts of the body are considered to be outside the agent. These parts might be internal to the robot but are external to the learning agent. So it is better to place the boundary of the learning agent not at the limit of its physical body but at the limit of control.

Since writing with a pencil involves explicit and verbal reasoning to a large extent, a symbolic model would be a better choice over a connectionist model, which on the other hand is used to capture cognition at microstructural level.

Symbolic Model: The computer has a mechanism that can exhibit indefinitely flexible, complex, responsive, and task-oriented behavior. We observe in humans flexible and adaptive behavior in seemingly limitless abundance and variety. The architecture which evolves out of this observation is called symbolic architecture. As we have depicted in the algorithm (refer to the group report) the robot could be trained by feeding it with an intial set of symbols like I affector, M affector, Hook function, Move function, etc. and then it could process this information to generate newer representations and operations on this representation. Once the robot has a basic knowledge of these symbols, it could also break down the task of writing into functions by using this basic skeleton of symbols.

References used:-

�	MIT Cognet Library: http://cognet.mit.edu/library/erefs/mitecs/sutton.html
�	MIT Cognet library: http://cognet.mit.edu/library/books/mitpress/0262161125/cache/chap3.pdf
�	http://www.gatsby.ucl.ac.uk/~dayan/papers/dw01.pdf