Learning To Play Table Tennis - A Reinforcement Learning Approach

Course project for ME768
Instructor Dr. Amitabh Mukherjee
Students : Sudipta N. Sinha , Anshuman Rai

Contents

Introduction - aim & objectives
Past Work
Motivation
Methodology
Expected Results
References
Online Resources

Introduction

Aim of the project -

In this course project we propose to build a simulator for theTable -tennis game with the main focus on designing various approaches to the other side of the table. using to make the players learn the shots in the virtual table-tennis environment that we will simulate. The simulator will provide an alternative to experimenting with real robots by modelling virtual robot players with neural networks controlling them.
Back to Contents

Why Reinforcement Learning ?

Reinforcement Learning has been used for control problems like Elevator Dispatching, dynamic channel allocation and strategy games like Back-Gammon and Checkers with very large state spaces of the order of 10^20. An alternative form of learning called supervised learning is learning from examples provided by an external agent, but alone it is not enough for learning from interaction. In interactive problems such as ours is a bit impractical to get examples of desired behavior that are both accurate and representative of all the situations in which the agent has to act. In uncharted territory---where one would expect learning to be most useful---an agent must be able to learn from its own experience. A game like Table-tennis ( or similar racquet sports) provide an interesting mix of control problems and strategy problems making the task of developing good players quite challenging.

Past Work -

Table-Tennis Simulator used for Neural Networks -

Some illustrative work has been done by D Aulignac, A. Moschovinos and S. Lucas in building a 2D virtual Table-tennis simulator at Vase Labs, Univ. of Essex with the aim of holding a table tennis tournament where different players( robot controllers) could play against each other. They chose to have a 2D simulation and design neural networks and then train them using training sets generated by another program (the algorithmic controller). The simulator they initially built used a multilayer perceptron (MLP) architecture and later comparisons were made with radial-basis functions (RBF) architecture. A second approach which they have not implemented but suggested is the use of modular neural networks. It involves decomposing the task into a smaller sub-tasks where each is handled by a specialist network.

The Acrobot

Reinforcement Learning has been applied to the task of simulating a gymnast as a two-link robot arm that learns to swing on a highbar. The system has been widely studied by control engineers (eg. Spong, 1994) and machine learning researchers (eg. Dejong and Spong, 1994; Boone, 1997). The learning algorithm used was Sarsa(lambda) with linear function approximation, tile coding, and replacing traces with a separate set of tilings for each action. Although this project has nothing to do with Table-tennis, this is essentially modelled as a Markov Decision Problem (MDP) and the reinforcement learning method used here is linear function approximation and tile coding. This introduces the key issue of generalisation, how experience with a limited subset of the state space is generalized to produce a good approximation over a much larger subset..This issue is also important in our Table-tennis agent

Back to Contents

Motivation

Perhaps the greatest drive/motivation for this project is that little work has been done in this area - although similar problems have been tackled. Also the simulation can be a precursor to solutions for other aspects of this problem - namely vision and robotics, and integrating these solutions, we can perhaps get to see an actual robot playing table-tennis! By building the simulator we are creating a framework where we can test appropriate algorithms for the controlling a robot without going into the physical aspects. Also since racket sports like tennis, badminton, squash, and even games like baseball - are essentially the same in principal as far as reaching the ball is concerned - only difference being the performance measure of a shot and other physical aspects, with some changes in parameters, without modifying the basic framework, the simulator can model other sports. By incorporating human-like physical constraints for racket motion - the simulation can provide insights into the game of table-tennis.

Back to Contents

Methodology

The physics of the real game are simulated to adapt the requirements of a real-time virtual environment. These include

The trajectory of the ball , under the effect of gravity and air friction,
The bounce on the table, which is a case of inelastic collision
and the hits with the bat, which determines the return velocity of the ball..

The simulator does not consider the rules of the game of Table-tennis . for eg. we do not have a separate rule for rallies or service points. The idea is just to return one shot - the serve is always generated by the simulator. At every time-step the simulator passes a few parameters to the player which are as follows :

The position vector of the ball (three coordinates).
The velocity vector of the ball (three components).

The Graphics Interface :

Various components of the simulator are displayed in a graphical interface

The ball is governed by the laws of motion and other affects like air drag, also deviation due to spin may be considered.
The bats are the agents - learning their moves. Complications like top-spin and sliced shots have been dropped from our model with one simplifying assumption - the normal to the face of the bat is the same as the velocity vector direction.
Tables, ground surface and net : These components determine the performance measure of the shot. A shot which hits the net, lands on the ground orl ands on the players side after hitting the bat is a bad shot.

We have come up with two alternative approaches to the problem, at present we will pursue both of them and continue with better one once we start getting some experimental resullts.

Reinforcement Learning Algorithm :

A Q-learning algorithm with error-back propagation using neural nets is used for reinforcement learning in both the approaches . The essence of Q-learning is the learning and use of a Q-function Q(x,a) that is a prediction of a weighted sum of future reinforcements given that action a is taken when the system is in the state represented by "x".

The state of the problem can be described in terms of some state variables . At each time-step the simulator generates coordinates and velocity for the ball and the problem boils down to determining an action which may/may not change state variables - so that at the point of intercept the racket has same coordinates as the ball and the desired parameters to hit a good shot. The performance measure for a good shot determines the training of the algorithm.

Generating coordinates at each time step : At present the simulator generates the coordinates - taking into account the physics involved, also to account for the errors in real-time tracking using vision, we introduce an error in coordinates at each time step. To corelate with human responses, the error function must decrease as the ball approaches the player

A Possible Error Function E

Performance measure for a shot : To train the algorithm for a specific shots. for instance , hitting the ball deep, along the lines - the reward function for Q-learning algorithm can be defined over the 2-d space of the table .

A Possible Reward Function R(x,z)

Approach 1

Underlying Assumptions : We assume that the velocity vector and coordinates for the bat are independent of each other - under this assumption we can decompose the problem of determing coordinates and velocity of the bat into two individual and unrelated parts.
The two modules are as follows:

Intercepting the ball - A multi-layered perceptron network gives a sequence of actions for the bat to intercept the ball. Since the time steps in this case can be broken down naturally into a sequence of states - with each state characterised by - ball postion, ball velocity vector, bat position . In this case the action for the RL algorithm is increments in bat coordinates - x,y,z and the reward function is -1 for missing the ball, 1 for intercept and 0 for others.

where,
S is the problem state
A is the action set for the state S.
X ,Y,Z are coordinates
V is velocity
The superscripts b and r refer to the ball and racket respectively.
The subscripts denote the components
terms supercripted with ' are determined by the simulator
terms superscripted   by * are determined by the agent action

Determining the velocity/force of the bat at the point of interception : This phase again uses a neural network and a training set (described below). The input to the neural network is the interception point and the output produced is the velocity of the bat that will return the ball to the other side of the table. The generation of input/output pairs (training set) is done by a separate program that considers a particular perfomance measure.

Approach II
In this case at each timestep the action set comprises of increments in the velocity components, which using the coordinates of the bat in the previous state determines the next state. Each state is characterised by - ball position, ball velocity, bat position , bat velocity. Here also we use a Q-learning network with a different reward function R
              R(s) = a high -ve value          for missing the ball altogether
                        = r(x,z)
                          when the ball is returned and (x,z) is the point where the ball lands on the table (see fig.)
Here the trajectory of the bat will be continuous because the increments in the position coordinates
are dependent on the increments in velocity components.

where,
S is the problem state
A is the action set for the state S.
X ,Y,Z are coordinates
V is velocity
The superscripts b and r refer to the ball and racket respectively.
The subscripts denote the components
terms supercripted with ' are determined by the simulator
terms superscripted   by * are determined by the agent action

Back to Contents

Expected Results

The results in approach I will be more predictable especially from the second module because this does not involve Q-learning.
Approach I
The following images show the frames captured with the corresponding state representation during the phase of finding intercept point between ball and bat. The motion of the bat is shown in dotted lines (our guesses)
           Frame No - 1
         ball position coordinates     =     6.50     8.25    1.90
         ball velocity-components   = -50.00 -2.98 -1.00

(open image in new window for better view)
Frame No - 2
ball position coordinates   =   2.86    7.98    1.83
ball velocity-components= -22.69 -2.33 -0.45

Frame No - 3
ball position coordinates =   -0.88   6.46   1.96
ball velocity-components = -13.53 -5.92 -0.71

Frame No - 4
ball position coordinates   = -6.26   5.85   1.59
ball velocity-components = -5.75 1.93 -0.14

Approach II
In this case the motion of the bat will be continuous. Also we expect some creative solutions in this case, for instance, the one illustrated below. It will favour such human-like solutions (shots) with back-swing , i.e., a better prepared player will play a better shot.

Back to Contents

References

C.W Anderson and Zhaohui Hong, Reinforcement Learning with Modular Neural Networks for Control, Colorado State University, Fort Collins, CO 80523, anderson@cs.colostate.edu .

D. d' Aulignac, A. Moschovinos, and S. Lucas, Virtual table tennis and the design of neural network players, Deptt. of Electronic Systems Engineeeing, Univ. of Essex, Colchester, C04 3SQ, UK, sml@essex.ac.uk
C.W. Anderson, Deptt, of Computer Science, Q-Learning with Hidden-Unit Restarting, Colorado State University, Fort Collins, CO 80523, anderson@cs.colorado.edu

Back to Contents

Links to Resources:

Reinforcement Learning - an Introduction by Dr. Richard S. Sutton.
Course Notes on Artificial Intelligence (Chapter 19.20) by Dr, Amitabh Mukherjee.
Table-Tennis Simulator used for Neural Networks
Collection of machine learning links (also link to CMU Artificial Intelligence Repository )

Back to index