# End-to-end computation graphs for RL

This blogpost introduces version 0.3 of the TensorForce reinforcement learning (RL) library, and motivates major design changes. Development was guided by the aim to provide a better interface and implementation for optimization modules, and later execution. More generally, with version 0.3, we are taking a big step towards RL models as pure TensorFlow objects, including all control-flow.

## Motivation

The key architecture problem in the first release of TensorForce were optimizers. The implementation of natural gradients in the TRPO model was an awkward mix of TensorFlow and NumPy intertwined with RL control flow, and evolutionary methods (which were not available then) would have faced similar problems. Such entangled modules are annoying because they interfere with things such as a generic distributed execution mode.

### How typical RL implementations work

The basic operations in RL are act(state) and observe(reward) — an agent encounters a state, reacts with an action and observes a subsequent reward. Every general-purpose RL framework will provide this abstract interface for interaction with the available model implementations. In the case of deep RL, the agent usually first collects a batch of experiences before infrequently updating its model parameters, consequently requiring a third fundamental API operation update(states, actions, rewards). These updates can be quite complex for some models and thus implementations tend to get messy.

Take the example of a (deep deterministic) policy gradient implementation in a popular deep RL library like OpenAI baselines, or TensorForce until recently: The act equivalent is usually a straightforward call to the underlying deep neural network feeding the given state representation to retrieve the corresponding action, while the observe function just adds reward information to the internal memory. The update operation, however, involves various steps, for instance:

1. Call to the TensorFlow baseline model to retrieve and optimize estimates for the state values.
2. Calculation of the discounted cumulative reward in Python/NumPy.
3. (Multiple) call(s) to the TensorFlow policy gradient model for optimization, e.g. in the case of multiple line search steps in TRPO.

Note that the TensorFlow computation graph created here is an entangled fusion of operations performed for the different steps. This means, for instance, that act fetches values from the same (partially) shared computation graph as update, which illustrates a fundamental lack of modularity.

### Issues with such entangled implementations

This mix of Python and TensorFlow within an update step as well as the entangled computation graphs can lead to undesirable properties, and might cause problems with respect to the following properties:

• Modularity. We think that, while such mixed implementations are preferable for research, they make it harder to design a modular RL framework, since functions are fundamentally limited to either domain, i.e. creating computation operations in TensorFlow, or implementing procedures in Python which themselves potentially interact with the TensorFlow computation graph. The way components are implemented might vary between models, for example, DQN is typically implemented in pure TensorFlow via the native TensorFlow Adam optimizer, while TRPO’s update (trust-region natural gradient optimization) is typically Python-based with various TensorFlow calls. This leads to a tendency to aggregate the entire model implementation in one large script, which is hard to understand, inflexible, and difficult to adapt to other use cases than the one it was written for (often OpenAI Gym or other simulations).
In contrast, modular implementations of act and update as pure TensorFlow functions enable a cleaner design, which is what we try to provide with TensorForce — version 0.3 provides interchangeable optimization modules (e.g. DQN can be used with an evolutionary optimizer) instead of optimization code baked into model implementations. The fact that (parts of) a model can be reused in other algorithms makes it easy to combine things such as integrating the DQN logic as baseline for a policy gradient model, and create a single combined computation graph for the update function of this new model.

• Distributed execution. Apart from being conceptually undesirable, an obvious problem with a computation graph shared between act and update is that configuring distributed execution is harder. More generally, being able to specify the allocation and execution of operations can be desirable. Defining the entire update logic as modularly structured TensorFlow operations gives full control over the computation graph definition and execution. Hence it is possible, for instance, to specify multiple worker model instances which locally only implement the act logic and collect experience into a global memory, while the update logic is implemented in a global model instance which performs updates and periodically updates the parameters of its worker instances.

• Exporting models and deployment. For TensorForce to become a truly useful applied RL library, models should be as independent from the Python script wrapper as possible, so that the TensorFlow computation graph can be imported with other language bindings or deployed with TensorFlow serving. Unlike supervised deep learning models, popular RL model implementations are usually not purely defined in terms of TensorFlow operations, and consequently cannot be exported and used as abstract RL computation graph.

• Performance We expect that many calls to the TensorFlow graph might slow down the optimization procedure, since it involves switching between Python-based and TensorFlow’s C-based execution contexts many times. Implementing the entire model logic in TensorFlow avoids this and gives TensorFlow the possibility to optimize the graph execution internally.

### The role of optimizers for the computation graph

The update component of a model can be divided into the definition of a loss term based on the batch of states, actions and rewards, and the specification of how this loss expression is used to update model parameters. For the latter part, a simple gradient-descent-style optimizer calculates the gradients of the loss function and updates parameters based on it. The PPO model works similarly, but involves multiple updates per optimization step. The update procedure of the TRPO model with its natural-gradient-based optimizer is more complex and comprises many different steps. An explanation of TRPO’s update mechanism and its implementation in TensorForce 0.3 is outlined below.

The optimization method specifies where and how the loss term is integrated into the update procedure, and hence defines how the operations in the computation graph are assembled. As we illustrate in the next section, it makes sense to frame the model as providing the callable function of calculating the loss given an input, while the optimizer module uses it to create the actual TensorFlow graph. We have not seen this interpretation of optimizers realized in a library design before, and we hope it leads to much cleaner and more modular implementations of e.g. natural gradient models.

## TensorForce 0.3 design details

In the following we are going to have a detailed look at the new implementation. Note that not all features have been fully pushed into the graph yet (e.g. replay memories), but the principal breaking changes for this have been integrated.

TensorForce version 0.3 makes heavy use of TensorFlow’s tf.make_template, which can be understood as turning a Python function of TensorFlow operations into what we call a ‘TensorFlow function’. It guarantees that every variable introduced as part of the function is unique and repeatedly calling the function uses these same variables, which are created on the first call. By modularly defining the RL models not as a computation graph, but as a callable graph creator function which is passed to the optimizer module, the actual computation graph can be instantiated by the respective optimizer according to the underlying algorithm.

A simple example: As mentioned earlier, the optimization procedure in PPO repeatedly updates its weights based on the current loss. The optimizer step can be implemented as follows (simplified and incomplete version of tf_step of the new MultiStep optimizer):

def tf_step(self, variables, fn_loss, **kwargs):
deltas = ()
for _ in xrange(self.num_steps):
with tf.control_dependencies(control_inputs=deltas):
deltas = self.optimizer.step(
variables=variables,
fn_loss=fn_loss,
**kwargs
)

This snippet repeatedly calls the optimization function optimizer.step of an internal optimizer, passing the loss callable fn_loss as argument, which the internal optimizer can use to create the computation graph necessary for calculating the loss in one iteration after application of the previous optimization step (ensured via tf.control_dependencies). The loss is a fairly common required optimizer argument, but it is possible to use others via the additional arguments in **kwargs — the NaturalGradient optimizer, for instance, uses a KL-divergence callable fn_kl_divergence in addition, while the Synchronization optimizer (see next paragraph) uses neither of these, but instead receives a list of source variables to synchronize the weights with.

MultiStep is an instance of the new MetaOptimizer base class, which takes an optimizer and modifies its update in some way, in this case applies it multiple times. Another example is the OptimizedStep optimizer which implements a line search along the update step proposed by its internal optimizer, to find a potentially more optimal fraction of this step (used by the TRPO model in combination with a natural gradient optimizer). Another less common optimizer example is the aforementioned Synchronization optimizer. It defines an infrequent, ‘lagging’ update step towards the weights of another optimized model which, for instance, corresponds to the infrequent parameter updates commonly found in Q-models for the target network.

The function prefix tf_ is ubiquitous in version 0.3 and marks a function as a template which is turned into a proper TensorFlow function via tf.make_template. The Distribution class, for instance, implements functions like Distribution.tf_sample, Distribution.tf_entropy or Distribution.tf_kl_divergence, which are all turned into their TensorFlow function equivalents Distribution.sample, Distribution.entropy and Distribution.kl_divergence. The same happens with the tf_step function above in the Optimizer class, so that even optimization functions can be reused, as we saw is the case in the MultiStep optimizer:

self.step = tf.make_template(
name_='step',
func_=self.tf_step,
custom_getter=custom_getter
)

We pushed this modularization and ‘TensorFlow-ification’ to the point that our entire Model class basically only interacts via Model.act and Model.observe for step-to-step action/reward retrieval, and Model.update for periodic parameter updates based on batches of experience (plus helper function Model.reset). As outlined before, this is different from many other implementations of RL models, where TensorFlow and Python code are very intertwined and at least updating the model involves a mix of both.

The following diagram illustrates the computation flow for the loss function of a policy gradient model with its various calls to model-internal TensorFlow functions:

Note that we have not yet moved the functionality of the Agent class to pure TensorFlow. This includes preprocessing, exploration, internal state handling and experience collection, either based on a batch of preceding timesteps or a replay memory of past experience.

## Optimization modules

The diagram below illustrates all TensorForce optimizers implemented as of yet. To explain the new mechanics, we focus on the optimizer behind TRPO, which combines the NaturalGradient and MultiStep optimizer.

### The optimization algorithm behind TRPO

At its core, the TRPO model is just a variant of a policy gradient model, in which reward signals (or advantage values) are weighted by the likelihood ratio of the updated stochastic policy with respect to the previous policy, instead of by the log likelihood of the updated policy, that is,

$\mathrm{L}(w + \Delta w) = \frac{1}{n} \sum_t \frac{ P_{w + \Delta w}(s_t) }{ P_w(s_t) } \cdot r_t.$

The same loss expression is, for instance, used by the PPO model. The major difference between these two models lies in the way they are optimized. In the following we outline how the natural-gradient-based optimizer for TRPO works (skip math).

### Background: Derivation of the natural gradient

TRPO’s optimization mechanism is comparatively complex and involves a sequence of steps. Starting with the following problem formulation,

${\arg \max}_{\Delta w} \left( -\mathrm{L}(w + \Delta w) \right) \quad\mathrm{such\ that}\quad \mathrm{KL}(P_w || P_{w + \Delta w}) = c,$

the loss is approximated by its first-order Taylor approximation around $$\Delta w = 0$$,

$f(\Delta w) := -\mathrm{L}(w) \:- \nabla \mathrm{L}(w) \cdot \Delta w,$

and the KL-divergence by its second-order Taylor approximation which, all else being zero, corresponds to the second derivative of the KL-divergence,

$g(\Delta w) := \mathrm{KL}(P_w || P_{w + \Delta w}) = \frac{1}{2} \Delta w^T \cdot \nabla^2 \mathrm{KL}(P_w || P_{w + \Delta w}) \cdot \Delta w = \frac{1}{2} \Delta w^T \cdot F(w) \cdot \Delta w.$

$$F(w) := \nabla^2 \mathrm{KL}(P_w || P_{w + \Delta w})$$ is the Fisher information matrix. This yields the following approximated optimization problem:

${\arg \max}_{\Delta w} f(\Delta w) \quad\mathrm{such\ that}\quad g(\Delta w) \:- c = 0,$

for which the corresponding Lagrangian looks as follows:

$\mathcal{L}(\Delta w, \lambda) = f(\Delta w) \:- \lambda \cdot (g(\Delta w) \:- c).$

The stationary points are characterized by the gradient of the Lagrangian for $$\Delta w$$ and $$\lambda$$ being zero. The former gives us

$\nabla f(\Delta w) \:- \lambda \cdot \nabla g(\Delta w) = -\nabla \mathrm{L}(w) \:- \lambda \cdot F(w) \cdot \Delta w = 0,$

which can be rewritten as a system of linear equations

$F(w) \cdot (\lambda \cdot \Delta w) = -\nabla \mathrm{L}(w).$

We can calculate $$F(w)$$ and $$\nabla \mathrm{L}(w)$$, hence this system can be solved for $$\Delta \tilde{w} := \lambda \cdot \Delta w$$, for instance using the iterative conjugate gradients method. Finally, the latter gradient condition given by

$g(\Delta w) \:- c = 0.5 \cdot \Delta w^T \cdot F(w) \cdot \Delta w \:- c = 0$

allows us to calculate $$\lambda$$ via

$\lambda^2 = \frac{1}{2 c} \cdot \Delta \tilde{w}^T \cdot F(w) \cdot \Delta \tilde{w}.$

We obtain the optimal step size $$\Delta w$$ and the estimated improvement $$\Delta \mathrm{L} := \nabla \mathrm{L}(w) \cdot \Delta w$$.

Due to the Taylor approximations, this estimate does not in general correspond to the actual improvement. Since the above calculation is computationally expensive, while just computing the loss is comparatively cheap, TRPO adds a line search step to determine a more optimal step size before performing the update. As part of the line search, the loss function $$\mathrm{L}(w + \Delta w)$$ is repeatedly evaluated with exponentially decreasing step sizes $$m \cdot \Delta w$$\$ (at least in John Schulman’s implementation) with $$m \in \{1.0, 0.5, 0.25, \dots\}$$ until the first step size is found with an actual improvement of at least the acceptance ratio $$\alpha$$ ($$0.1$$ in Schulman’s implementation):

$\frac{ \mathrm{L}(w) \:- \mathrm{L}(w + m \cdot \Delta w) }{ m \cdot \Delta \mathrm{L} } \geq \alpha.$

### Natural gradients and line search in TensorForce

Previous versions of TensorForce, and other implementations, define most of the calculations above as TensorFlow operations. However, the two iterative methods, i.e. the conjugate gradient algorithm and the line search, are implemented as Python loops using NumPy and repeated calls to the TensorFlow computation graph.

Our pure TensorFlow implementation of the TRPO optimizer is a combination of the NaturalGradient and OptimizedStep optimizer in the tensorforce.core.optimizers sub-package. The code snippets below are simplified and incomplete versions taken from the respective class files. We start with the NaturalGradient optimizer, which comprises essentially all of the math above apart from the last paragraph.

def tf_step(self, variables, fn_loss, fn_kl_divergence, **kwargs):

# Product to prevent having to actually calculate
# the entire matrix explicitly.
def fisher_matrix_product(x):
x = [tf.stop_gradient(input=delta) for delta in x]
])

deltas = self.solver.solve(
fn_x=fisher_matrix_product,
x_init=None,
)

delta_fisher_matrix_product = fisher_matrix_product(x=deltas)

tf.reduce_sum(input_tensor=(delta_F * delta))
for delta_F, delta in zip(delta_fisher_matrix_product, deltas)
])

lagrange_multiplier = tf.sqrt(x=(constant / self.learning_rate))

estimated_deltas = [delta / lagrange_multiplier for delta in deltas]

])

return estimated_deltas, estimated_improvement

The first difficulty is to solve the resulting system of linear equations $$F(w) \cdot (\lambda \cdot \Delta w) = -\nabla \mathrm{L}(w)$$ with the iterative conjugate gradient method, which happens as part of self.solver.solve(fn_x=..., b=...) in the code. The NaturalGradient class here uses our new ConjugateGradient solver from the sub-package tensorforce.core.optimizers.solvers, which itself is an instance of the iterative solver super-class Iterative(Solver). The Solver class specifies a template for a pure TensorFlow solver for any equation/optimization for $$x$$ involving an expression $$f(x)$$:

class Solver(object):

def __init__(self):
self.solve = tf.make_template(name_='solver', func_=self.tf_solve)

def tf_solve(self, fn_x, *args):
raise NotImplementedError

The Iterative solver class defines a still problem-agnostic template of iterative solver algorithms with initialization step, iteration loop and termination condition (default is a maximum number of iterations):

class Iterative(Solver):

def __init__(self, max_iterations):
super(Iterative, self).__init__()
self.max_iterations = max_iterations
self.initialize = tf.make_template(name_='initialize', func_=self.tf_initialize)
self.step = tf.make_template(name_='step', func_=self.tf_step)
self.next_step = tf.make_template(name_='next-step', func_=self.tf_next_step)

def tf_solve(self, fn_x, x_init, *args):
self.fn_x = fn_x
initial_args = self.initialize(x_init, *args)
final_args = tf.while_loop(
cond=self.next_step,
body=self.step,
loop_vars=initial_args
)
return final_args[0]  # First argument contains solution

def tf_initialize(self, x_init, *args):
return x_init, 0

def tf_step(self, x, iteration, *args):
iteration += 1
return (x, iteration) + args

def tf_next_step(self, x, iteration, *args):
return iteration < self.max_iterations

The ConjugateGradient class, finally, implements the conjugate gradient algorithm, solving a system of the form $$f(x) = A \cdot x \overset{!}{=} b$$. For this, it defines the solver signature as tf_solve(fn_x, x_init, b) with fn_x corresponding to $$A(x)$$, b corresponding to $$b$$ and x_init corresponding to the initial solution guess $$x_0$$ (zero vector if not specified):

deltas = self.solver.solve(
fn_x=fisher_matrix_product,
x_init=None,
)

Not much is gained from showing the entire implementation here, except for the fact that the initialization tf_initialize as well as the body of the loop tf_step call the fn_x function once each, in self.fn_x(x=x_init) and self.fn_x(x=conjugate) respectively, which correspond to $$A x_0$$ and $$A p_k$$ appearing twice in the Wikipedia pseudo-code for the algorithm.

Next, the OptimizedStep optimizer takes the estimated optimal step size and improvement of the NaturalGradient optimizer and applies the iterative line search solver, which can also be found in the tensorforce.core.optimizers.solvers sub-package. The basic principle of an iterative solver is the same and we skip over the concrete implementation code here.

### Optimizers as interchangeable configuration entries

Finally, the optimizer can be specified as part of the JSON agent configuration as follows, and in the case of the TRPO agent class is predefined as such:

"optimizer": {
"type": "optimized_step",
"optimizer": {
"learning_rate": 1e-3,
"cg_max_iterations": 20,
"cg_damping": 1e-3
},
"ls_max_iterations": 10,
"ls_accept_ratio": 0.9
}

To give another example, and illustrate how the new optimizer module can be changed in a few lines, here the corresponding JSON specification for the evolutionary optimizer:

"optimizer": {
"type": "evolutionary",
"learning_rate": 1e-2,
"num_samples": 5
}

## What’s next

The short-term architectural goal for TensorForce from here on is clear: push all the remaining pieces (memories, preprocessing) into the TensorFlow graph. The logical next step is to think about executing this graph more optimally according to the available hardware. We originally implemented A3C semantics via distributed TensorFlow and a shared-model multi-threaded runner.

Aside from the observation (e.g. noted by OpenAI in the ACKTR/A2C post) that it is unclear whether noise from asynchronous execution really improves training, version 0.4 will center on execution modes for GPU batching (inspired by GA3C/PAAC).