Learning to act using real-time dynamic programming
Artificial Intelligence
Andrew G. Barto
Steven J. Bradtke
A stochastic approximation method with max-norm projections and its applications to the Q-learning algorithm