Temporal Difference Methods in Machine Learning

This is a summary about paper Sutton, et al. 1988. Learning to predict by the methods of temporal differences. This paper provided a complete discussion about the temporal difference methods in the learning to predict task, which takes observations and try to predict outcomes from those observations like classification problem. This summary borrowed a lot of ideas from Tasha’s presentation and centers around the comparision with the supervised learning method.

First is a clarification about what is temporal difference methods and what is the difference between temporal difference and supervised learning. One main difference, and also the benefit of temporal difference, is that, supervised learning can not update the prediction at each time step until the very end when it knows the actual outcome, while temporal difference can sort of update the prediction once it reaches the next step. So the temporal difference method is beneficial for the amount of computation and the storage space. Note that besides the differences we mentions above, we can show the results of temporal difference methods and supervised learning methods are generally the same with specific constuctions.

Here are some necessary notations and formalizations. Say we have multi-step prediction problems with observation-outcome sequence x1,x2,,xm,zx_{1}, x_{2}, \ldots, x_{m}, z where xtx_{t} is a vector of observations available at time tt and scalar zz is the outcome. The learner produces a sequence of predictions estimating ZZ: P1,P2,,PmP_{1}, P_{2}, \ldots, P_{m} where Pt= def P(xt,w)P_{t} \stackrel{\text { def }}{=} P\left(x_{t}, w\right) and ww is a vector of modifiable weights. The goal of learning is to correctly update ww by determining Δwt,\Delta w_{t}, an increment to ww from each observation:

ww+t=1mΔwt(1)w \leftarrow w+\sum_{t=1}^{m} \Delta w_{t} \tag{1}

Generally speaking, in supervised learning, the update will be :

Δwt=α(zPt)wPt(2)\Delta w_{t}=\alpha\left(z-P_{t}\right) \nabla_{w} P_{t} \tag{2}

where wPt\nabla_{w} P_{t} is the vector of partial derivatives of PtP_{t} with respect to each component of ww.

And if we concentrate on a special case where PtP_t is a linear function of xtx_t and ww (Widrow-Hoff procedure):

Δwt=α(zwTxt)xt\Delta w_{t}=\alpha\left(z-w^{T} x_{t}\right) x_{t}

Note here, for the supervised learning method, all updates update on z.

But the results are the same for the two methods; The key is to represent the error zPtz-P_t as a sum of changes in predictions:

zPt=k=tm(Pk+1Pk) where Pm+1= def zz-P_{t}=\sum_{k=t}^{m}\left(P_{k+1}-P_{k}\right) \quad \text { where } \quad P_{m+1} \stackrel{\text { def }}{=} z

Using this, equations (1) and (2) can be combined as:

ww+t=1mα(zPt)wPt=w+t=1mαk=tm(Pk+1Pk)wPt=w+k=1mαt=1k(Pk+1Pk)wPt=w+t=1mα(Pt+1Pt)k=1twPk.\begin{aligned} w \leftarrow w+\sum_{t=1}^{m} \alpha\left(z-P_{t}\right) \nabla_{w} P_{t} &=w+\sum_{t=1}^{m} \alpha \sum_{k=t}^{m}\left(P_{k+1}-P_{k}\right) \nabla_{w} P_{t} \\ &=w+\sum_{k=1}^{m} \alpha \sum_{t=1}^{k}\left(P_{k+1}-P_{k}\right) \nabla_{w} P_{t} \\ &=w+\sum_{t=1}^{m} \alpha\left(P_{t+1}-P_{t}\right) \sum_{k=1}^{t} \nabla_{w} P_{k} . \end{aligned}

So:

Δwt=α(Pt+1Pt)k=1twPk(3)\Delta w_{t}=\alpha\left(P_{t+1}-P_{t}\right) \sum_{k=1}^{t} \nabla_{w} P_{k} \tag{3}

we have an update independent of z.

The hallmark of temporal-difference methods is their sensitivity to changes in successive predictions rather than to overall error between predictions and the final outcome, so we modified the above equation (3) and get the following:

Δwt=α(Pt+1Pt)k=1tλtkwPk\Delta w_{t}=\alpha\left(P_{t+1}-P_{t}\right) \sum_{k=1}^{t} \lambda^{t-k} \nabla_{w} P_{k}

This is called TD(λ)(\lambda) model. If we set λ=1\lambda =1 then the temporal diffenrce method is just the same as supervised learning.


Comments