The first part of this answer is a little background that might bolster your intuition for what's going on. The second part is the more practical and direct answer to your question.

The gradient is just the generalization of the derivative to multivariable functions. The gradient of a function at a certain point is a vector that points in the direction of the steepest increase of that function.

Usually, we take a derivative/gradient of some loss function $\mathcal{L}$ because we want to minimize that loss. So we update our parameters in the direction *opposite* the direction of the gradient.

$$\theta_{t+1} = \theta_{t} - \alpha\nabla_{\theta_t} \mathcal{L} \tag{1}$$

In policy gradient methods, we're not trying to minimize a loss function. Actually, we're trying to *maximize* some measure $J$ of the performance of our agent. So now we want to update parameters in the same direction as the gradient.

$$\theta_{t+1} = \theta_{t} + \alpha\nabla_{\theta_t} J \tag{2}$$

In the episodic case, $J$ is the value of the starting state. In the continuing case, $J$ is the average reward. It just so happens that a nice theorem called the Policy Gradient Theorem applies to both cases. This theorem states that

$$\begin{align}
\nabla_{\theta_t}J(\theta_t) &\propto \sum_s \mu(s)\sum_a q_\pi (s,a) \nabla_{\theta_t} \pi (a|s,\theta_t)\\
&=\mathbb{E}_\mu \left[ \sum_a q_\pi (s,a) \nabla_{\theta_t} \pi (a|s,\theta_t)\right].
\end{align}\tag{3}
$$

The rest of the derivation is in your question, so let's skip to the end.

$$\begin{align}
\theta_{t+1} &= \theta_{t} + \alpha G_t \frac{\nabla_{\theta_t}\pi(A_t|S_t,\theta_t)}{\pi(A_t|S_t,\theta_t)}\\
&= \theta_{t} + \alpha G_t \nabla_{\theta_t} \ln \pi(A_t|S_t,\theta_t)
\end{align}\tag{4}$$

Remember, $(4)$ says exactly the same thing as $(2)$, so REINFORCE just updates parameters in the direction that will most increase $J$. (Because we sample from an expectation in the derivation, the parameter step in REINFORCE is actually an unbiased estimate of the maximizing step.)

Alright, but how do we actually get this gradient? Well, you use the chain rule of derivatives (backpropagation). Practically, though, both Tensorflow and PyTorch can take all the derivatives for you.

Tensorflow, for example, has a minimize() method in its Optimizer class that takes a loss function as an input. Given a function of the parameters of the network, it will do the calculus for you to determine which way to update the parameters in order to minimize that function. But we don't want to minimize. We want to maximize! So just include a negative sign.

In our case, the function we want to minimize is
$$-G_t\ln \pi(A_t|S_t,\theta_t).$$

This corresponds to stochastic gradient descent ($G_t$ is not a function of $\theta_t$).

You might want to do minibatch gradient descent on each episode of experience in order to get a better (lower variance) estimate of $\nabla_{\theta_t} J$. If so, you would instead minimize
$$-\sum_t G_t\ln \pi(A_t|S_t,\theta_t),$$
where $\theta_t$ would be constant for different values of $t$ within the same episode. Technically, minibatch gradient descent updates parameters in the average estimated maximizing direction, but the scaling factor $1/N$ can be absorbed into the learning rate.

Are you asking if the gradient is with respect to an action? If yes, then, no, the gradient is not with respect to an action, but with respect to the parameters (of e.g. the neural network representing your policy), $\theta_t$. Conceptually, you don't need any "loss" to compute the gradient. You just need a multivariable differentiable function. In this case, the multivariable function is your parametrised policy $\pi(A_t \mid S_t, \theta_t)$. – nbro – 2019-04-21T20:23:09.020

@nbro Conceptually I understand that, but the action is the output of the policy. So what I don’t understand is how conceptually we are determining the gradient of the policy with respect to the action it output? Further if the action is making a choice, that wouldn’t be differentiable. I looked at the implementation that used an action probability instead. But I’m still unsure what is the gradient of the policy with respect to the output of the policy. It doesn’t make sense to me yet. – Hanzy – 2019-04-21T20:31:12.423

@nbro I guess I see that it’s technically the gradient of the probability of selecting an action At. But still not sure how to implement that. We want to know how the probability would change with respect to changing the parameters. Maybe I’m starting to get a little insight. – Hanzy – 2019-04-21T20:37:54.810

1

I think this is just a notation or terminology issue. To train your neural network representing $\pi$, you will need a loss function (that assesses the quality of the output action), yes, but this is not explicit in Barto and Sutton's book (at least in those equations), which just states that the gradient is with respect to the parameters (variables) of the function. Barto and Sutton just present the general idea. Have a look at this reference implementation: https://github.com/pytorch/examples/blob/master/reinforcement_learning/reinforce.py.

– nbro – 2019-04-21T20:48:32.870@nbro thanks for the link; I actually think I’m almost there. In this implementation we take the (negative) log probability and multiply it by the return. We moved the return inside the gradient since it’s not dependent on $theta$. Then we sum over the list of returns and use the summing function as the function over which we compute the gradient. So the gradient is the SUM of all future rewards wrt to theta (which, in turn, effects the choice of action). Is this correct? (I’m going back to read it again now). – Hanzy – 2019-04-22T01:29:45.473

@nbro sorry I meant to say the sum of all future rewards times the (negative) log probability of selecting the action taken that led to that reward. (I couldn’t edit my previous comment for clarity). – Hanzy – 2019-04-22T01:43:02.403

@nbro I’m assuming torch tracks the gradients by tracking entries in policy.saved_log_probs since they were drawn from torch.Categorical() distribution. Otherwise these log probabilities would simply be entries in a list w/o any way to track how they were computed or their gradients. – Hanzy – 2019-04-22T02:39:59.677

@nbro I’m now just trying to figure out why we want to sum over all the products of (negative) log probabilities of each action and their associated return instead of doing an update on each time step. The Sutton / Barto algorithm has a different update for every time step. I suppose if we are already tracking gradients of every log probability then instead of doing one update at a time we can do an aggregate update to a linear combination of them? This makes sense to me but I haven’t written it out to prove it. – Hanzy – 2019-04-22T02:53:10.420