Completed
Pull Request — master (#1079)
by David
04:54
created

GradientDescent.__init__()   D

Complexity

Conditions 10

Size

Total Lines 53

Duplication

Lines 0
Ratio 0 %
Metric Value
cc 10
dl 0
loc 53
rs 4.8

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like GradientDescent.__init__() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
"""Training algorithms."""
0 ignored issues
show
Bug introduced by
There seems to be a cyclic import (blocks.bricks.base -> blocks.graph -> blocks.graph.bn -> blocks.filter).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.bn -> blocks.bricks.sequences -> blocks.bricks.simple -> blocks.bricks.wrappers -> blocks.bricks.base -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.bn -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.wrappers -> blocks.bricks.base -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.bn -> blocks.bricks.sequences -> blocks.bricks.base -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.bn -> blocks.bricks.base -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.bn -> blocks.bricks.sequences -> blocks.bricks.simple -> blocks.bricks.base -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.bn -> blocks.bricks.sequences -> blocks.bricks.simple -> blocks.bricks.interfaces -> blocks.bricks.base -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
2
import logging
3
import itertools
4
from abc import ABCMeta, abstractmethod
5
from collections import OrderedDict
6
from six.moves import reduce
7
8
from picklable_itertools.extras import equizip
9
10
import theano
11
from six import add_metaclass
12
from theano import tensor
13
14
from blocks.graph import ComputationGraph
15
from blocks.roles import add_role, ALGORITHM_HYPERPARAMETER, ALGORITHM_BUFFER
16
from blocks.theano_expressions import l2_norm
17
from blocks.utils import (dict_subset, pack, shared_floatx,
18
                          shared_floatx_zeros_matching)
19
20
logger = logging.getLogger(__name__)
21
22
23
def _create_algorithm_buffer_for(param, *args, **kwargs):
24
    buf = shared_floatx_zeros_matching(param, *args, **kwargs)
25
    buf.tag.for_parameter = param
26
    add_role(buf, ALGORITHM_BUFFER)
27
    return buf
28
29
30
@add_metaclass(ABCMeta)
31
class TrainingAlgorithm(object):
32
    """Base class for training algorithms.
33
34
    A training algorithm object has a simple life-cycle.
35
    First it is initialized by calling its :meth:`initialize` method.
36
    At this stage, for instance, Theano functions can be compiled.
37
    After that the :meth:`process_batch` method is repeatedly
38
    called with a batch of training data as a parameter.
39
40
    """
41
    @abstractmethod
42
    def initialize(self, **kwargs):
43
        """Initialize the training algorithm."""
44
        pass
45
46
    @abstractmethod
47
    def process_batch(self, batch):
48
        """Process a batch of training data.
49
50
        Attributes
51
        ----------
52
        batch : dict
53
            A dictionary of (source name, data) pairs.
54
55
        """
56
        pass
57
58
59
variable_mismatch_error = """
60
61
Blocks tried to match the sources ({sources}) of the training dataset to \
62
the names of the Theano variables ({variables}), but failed to do so. \
63
If you want to train on a subset of the sources that your dataset provides, \
64
pass the `sources` keyword argument to its constructor. Or pass \
65
on_unused_sources='warn' or on_unused_sources='ignore' to \
66
the GradientDescent algorithm."""
67
68
source_missing_error = """
69
70
Blocks didn't find all the sources ({sources}) of the training dataset \
71
that match the names of the Theano variables ({variables})."""
72
73
74
class GradientDescent(TrainingAlgorithm):
75
    """A base class for all gradient descent algorithms.
76
77
    By "gradient descent" we mean a training algorithm of the following
78
    form:
79
80
    .. code-block::  python
81
82
        for batch in data:
83
            steps = step_rule.compute_steps(parameters,
84
                                            gradients_wr_parameters)
85
            for parameter in parameters:
86
                parameter -= steps[parameter]
87
88
    Note, that the step is *subtracted, not added*! This is done in order
89
    to make step rule chaining possible.
90
91
    Parameters
92
    ----------
93
    cost : :class:`~tensor.TensorVariable`, optional
94
        The objective to be minimized.
95
    parameters : list of :class:`~tensor.TensorSharedVariable`, optional
96
        The parameters to be tuned. If not provided, inferred from the
97
        keys of `gradients`.
98
    step_rule : instance of :class:`StepRule`, optional
99
        An object encapsulating most of the algorithm's logic. Its
100
        `compute_steps` method is called to get Theano expression for
101
        steps.  Note, that the step rule might have a state, e.g. to
102
        remember a weighted sum of gradients from previous steps like it is
103
        done in gradient descent with momentum. If ``None``, an instance of
104
        :class:`Scale` is created.
105
    gradients : dict, optional
106
        A dictionary mapping a parameter to an expression for the cost's
107
        gradient with respect to the parameter. If ``None``, the gradient
108
        are taken automatically using :func:`theano.gradient.grad`.
109
    known_grads : dict, optional
110
        A passthrough to `theano.tensor.grad`'s `known_grads` argument.
111
        Useful when you know the [approximate] gradients of some
112
        sub-expressions and would like Theano to use that information
113
        to compute parameter gradients. Only makes sense when `gradients`
114
        is `None`.
115
    consider_constant : list, optional
116
        A passthrough to `theano.tensor.grad`'s `consider_constant`
117
        argument.  A list of expressions through which gradients will not
118
        be backpropagated. Only makes sense when `gradients` is `None`.
119
    on_unused_sources : str, one of 'raise' (default), 'ignore', 'warn'
120
        Controls behavior when not all sources are used.
121
    theano_func_kwargs : dict, optional
122
        A passthrough to `theano.function` for additional arguments.
123
        Useful for passing `profile` or `mode` arguments to the theano
124
        function that will be compiled for the algorithm.
125
126
    Attributes
127
    ----------
128
    gradients : dict
129
        The gradient dictionary.
130
    step_rule : instance of :class:`StepRule`
131
        The step rule.
132
    updates : list of :class:`~tensor.TensorSharedVariable` updates
133
        Updates to be done for every batch. It is required that the
134
        updates are done using the old values of optimized parameters.
135
136
    Notes
137
    -----
138
    Changing `updates` attribute or calling `add_updates` after
139
    the `initialize` method is called will have no effect.
140
141
    .. todo::
142
143
       Some shared variables are not parameters (e.g. those created by
144
       random streams).
145
146
    .. todo::
147
148
       Due to a rather premature status of the :class:`ComputationGraph`
149
       class the parameter used only inside scans are not fetched
150
       currently.
151
152
    """
153
    def __init__(self, cost=None, parameters=None, step_rule=None,
154
                 gradients=None, known_grads=None, consider_constant=None,
155
                 on_unused_sources='raise', theano_func_kwargs=None, **kwargs):
156
        super(GradientDescent, self).__init__(**kwargs)
157
        # Set initial values for cost, parameters, gradients.
158
        self.cost = cost
159
        self._updates = []
160
        self.parameters = parameters
161
        self.gradients = gradients
162
163
        # If we don't have gradients, we'll need to infer them from the
164
        # cost and the parameters, both of which must not be None.
165
        if not self.gradients:
166
            if self.cost is None:
167
                raise ValueError("can't infer gradients; no cost specified")
168
            elif self.parameters is None or len(self.parameters) == 0:
169
                raise ValueError("can't infer gradients; "
170
                                 "no parameters specified")
171
            self.inputs = ComputationGraph(cost).inputs
172
            logger.info("Taking the cost gradient")
173
            self.gradients = dict(
174
                equizip(self.parameters, tensor.grad(
175
                    self.cost, self.parameters,
176
                    known_grads=known_grads,
177
                    consider_constant=consider_constant)))
178
            logger.info("The cost gradient computation graph is built")
179
        else:
180
            # If we have gradients, we get parameters from that.
181
            # If you're specifying both then something is screwy.
182
            if self.parameters is not None:
183
                logger.warning('{} received both gradients and parameters '
184
                               'arguments; using parameters deduced from '
185
                               'gradients')
186
            gradients_dict = dict(gradients)
187
            self.parameters = list(gradients_dict.keys())
188
            self.inputs = ComputationGraph(gradients_dict.values()).inputs
189
            if known_grads:
190
                raise ValueError("known_grads has no effect when gradients "
191
                                 "are passed in")
192
            if consider_constant is not None:
193
                raise ValueError("consider_constant has no effect when "
194
                                 "gradients are passed in")
195
        self.step_rule = step_rule if step_rule else Scale()
196
197
        self.total_gradient_norm = l2_norm(
198
            self.gradients.values()).copy(name="total_gradient_norm")
199
        self.steps, self.step_rule_updates = (
200
            self.step_rule.compute_steps(self.gradients))
201
        self.total_step_norm = l2_norm(
202
            self.steps.values()).copy(name="total_step_norm")
203
        self.on_unused_sources = on_unused_sources
204
        self.theano_func_kwargs = (theano_func_kwargs if theano_func_kwargs
205
                                   is not None else dict())
206
207
    def initialize(self):
208
        logger.info("Initializing the training algorithm")
209
        # Note: the gradients are computed in the same order in which
210
        # the parameters were given. Keep it like that to ensure
211
        # reproducibility.
212
        for parameter in self.parameters:
213
            self.updates.append((parameter, parameter - self.steps[parameter]))
214
        self.updates += self.step_rule_updates
215
        self._function = theano.function(
216
            self.inputs, [], updates=self.updates, **self.theano_func_kwargs)
217
        logger.info("The training algorithm is initialized")
218
219
    def _validate_source_names(self, batch):
220
        in_names = [v.name for v in self.inputs]
221
222
        if not set(in_names).issubset(set(batch.keys())):
223
            raise ValueError("Didn't find all sources: " +
224
                             source_missing_error.format(
225
                                 sources=batch.keys(),
226
                                 variables=in_names))
227
        if not set(batch.keys()).issubset(set(in_names)):
228
            if self.on_unused_sources == 'ignore':
229
                pass
230
            elif self.on_unused_sources == 'warn':
231
                if not hasattr(self, '_unused_source_warned'):
232
                    logger.warn(variable_mismatch_error.format(
233
                        sources=batch.keys(),
234
                        variables=in_names))
235
                self._unused_source_warned = True
236
            elif self.on_unused_sources == 'raise':
237
                raise ValueError(
238
                    "mismatch of variable names and data sources" +
239
                    variable_mismatch_error.format(
240
                        sources=batch.keys(),
241
                        variables=in_names))
242
            else:
243
                raise ValueError("Wrong value of on_unused_sources: {}."
244
                                 .format(self.on_unused_sources))
245
246
    def process_batch(self, batch):
247
        self._validate_source_names(batch)
248
        ordered_batch = [batch[v.name] for v in self.inputs]
249
        self._function(*ordered_batch)
250
251
    @property
252
    def updates(self):
253
        return self._updates
254
255
    @updates.setter
256
    def updates(self, value):
257
        self._updates = value
258
259
    def add_updates(self, updates):
260
        """Add updates to the training process.
261
262
        The updates will be done _before_ the parameters are changed.
263
264
        Parameters
265
        ----------
266
        updates : list of tuples or :class:`~collections.OrderedDict`
267
            The updates to add.
268
269
        """
270
        if isinstance(updates, OrderedDict):
271
            updates = list(updates.items())
272
        if not isinstance(updates, list):
273
            raise ValueError
274
        self.updates.extend(updates)
275
276
277
@add_metaclass(ABCMeta)
278
class StepRule(object):
279
    """A rule to compute steps for a gradient descent algorithm."""
280
    def compute_step(self, parameter, previous_step):
281
        """Build a Theano expression for the step for a parameter.
282
283
        This method is called by default implementation of
284
        :meth:`compute_steps`, it relieves from writing a loop each time.
285
286
        Parameters
287
        ----------
288
        parameter : :class:`~tensor.TensorSharedVariable`
289
            The parameter.
290
        previous_step : :class:`~tensor.TensorVariable`
291
            Some quantity related to the gradient of the cost with respect
292
            to the parameter, either the gradient itself or a step in a
293
            related direction.
294
295
        Returns
296
        -------
297
        step : :class:`~theano.Variable`
298
            Theano variable for the step to take.
299
        updates : list
300
            A list of tuples representing updates to be performed. This
301
            is useful for stateful rules such as :class:`Momentum` which
302
            need to update shared variables after itetations.
303
304
        """
305
        raise NotImplementedError
306
307
    def compute_steps(self, previous_steps):
308
        """Build a Theano expression for steps for all parameters.
309
310
        Override this method if you want to process the steps
311
        with respect to all parameters as a whole, not parameter-wise.
312
313
        Parameters
314
        ----------
315
        previous_steps : OrderedDict
316
            An :class:`~OrderedDict` of
317
            (:class:`~tensor.TensorSharedVariable`
318
            :class:`~tensor.TensorVariable`) pairs. The keys are the
319
            parameters being trained, the values are the expressions for
320
            quantities related to gradients of the cost with respect to
321
            the parameters, either the gradients themselves or steps in
322
            related directions.
323
324
        Returns
325
        -------
326
        steps : OrderedDict
327
            A dictionary of the proposed steps in the same form as
328
            `previous_steps`.
329
        updates : list
330
            A list of tuples representing updates to be performed.
331
332
        """
333
        parameter_wise = [self.compute_step(parameter,
334
                                            previous_steps[parameter])
335
                          for parameter in previous_steps]
336
        steps, updates = equizip(*parameter_wise)
337
        steps = OrderedDict((parameter, step) for parameter, step
338
                            in equizip(previous_steps.keys(), steps))
339
        updates = list(itertools.chain(*updates))
340
        return steps, updates
341
342
343
class CompositeRule(StepRule):
344
    """Chains several step rules.
345
346
    Parameters
347
    ----------
348
    components : list of :class:`StepRule`
349
        The learning rules to be chained. The rules will be applied in the
350
        order as given.
351
352
    """
353
    def __init__(self, components):
354
        self.components = components
355
356
    def compute_steps(self, previous_steps):
357
        steps = previous_steps
358
        updates = []
359
        for rule in self.components:
360
            steps, more_updates = rule.compute_steps(steps)
361
            updates += more_updates
362
        return steps, updates
363
364
365
class Scale(StepRule):
366
    """A step in the direction proportional to the previous step.
367
368
    If used in :class:`GradientDescent` alone, this step rule implements
369
    steepest descent.
370
371
    Parameters
372
    ----------
373
    learning_rate : float
374
        The learning rate by which the previous step is multiplied to
375
        produce the step.
376
377
    Attributes
378
    ----------
379
    learning_rate : :class:`~tensor.TensorSharedVariable`
380
        The shared variable storing the learning rate used.
381
382
    """
383
    def __init__(self, learning_rate=1.0):
384
        self.learning_rate = shared_floatx(learning_rate, "learning_rate")
385
        add_role(self.learning_rate, ALGORITHM_HYPERPARAMETER)
386
387
    def compute_step(self, parameter, previous_step):
388
        return self.learning_rate * previous_step, []
389
390
391
class BasicMomentum(StepRule):
392
    """Accumulates step with exponential discount.
393
394
    Parameters
395
    ----------
396
    momentum : float, optional
397
        The momentum coefficient. Defaults to 0.
398
399
    Notes
400
    -----
401
    This step rule is intended to be used in conjunction with another
402
    step rule, _e.g._ :class:`Scale`. For an all-batteries-included
403
    experience, look at :class:`Momentum`.
404
405
    """
406
    def __init__(self, momentum=0.):
407
        self.momentum = shared_floatx(momentum, "momentum")
408
        add_role(self.momentum, ALGORITHM_HYPERPARAMETER)
409
410
    def compute_step(self, parameter, previous_step):
411
        velocity = _create_algorithm_buffer_for(parameter, "velocity")
412
        step = self.momentum * velocity + previous_step
413
        updates = [(velocity, step)]
414
        return step, updates
415
416
417
class Momentum(CompositeRule):
418
    """Accumulates step with exponential discount.
419
420
    Combines :class:`BasicMomentum` and :class:`Scale` to form the
421
    usual momentum step rule.
422
423
    Parameters
424
    ----------
425
    learning_rate : float, optional
426
        The learning rate by which the previous step scaled. Defaults to 1.
427
    momentum : float, optional
428
        The momentum coefficient. Defaults to 0.
429
430
    Attributes
431
    ----------
432
    learning_rate : :class:`~tensor.SharedVariable`
433
        A variable for learning rate.
434
    momentum : :class:`~tensor.SharedVariable`
435
        A variable for momentum.
436
437
    See Also
438
    --------
439
    :class:`SharedVariableModifier`
440
441
    """
442
    def __init__(self, learning_rate=1.0, momentum=0.):
443
        scale = Scale(learning_rate=learning_rate)
444
        basic_momentum = BasicMomentum(momentum=momentum)
445
        self.learning_rate = scale.learning_rate
446
        self.momentum = basic_momentum.momentum
447
        self.components = [scale, basic_momentum]
448
449
450
class AdaDelta(StepRule):
451
    """Adapts the step size over time using only first order information.
452
453
    Parameters
454
    ----------
455
    decay_rate : float, optional
456
        Decay rate in [0, 1]. Defaults to 0.95.
457
    epsilon : float, optional
458
        Stabilizing constant for RMS. Defaults to 1e-6.
459
460
    Notes
461
    -----
462
    For more information, see [ADADELTA]_.
463
464
    .. [ADADELTA] Matthew D. Zeiler, *ADADELTA: An Adaptive Learning
465
       Rate Method*, arXiv:1212.5701.
466
467
    """
468
    def __init__(self, decay_rate=0.95, epsilon=1e-6):
469
        if not 0.0 <= decay_rate <= 1.0:
470
            raise ValueError("decay rate needs to be in [0, 1]")
471
        self.decay_rate = shared_floatx(decay_rate, "decay_rate")
472
        add_role(self.decay_rate, ALGORITHM_HYPERPARAMETER)
473
        self.epsilon = shared_floatx(epsilon, "epsilon")
474
        add_role(self.epsilon, ALGORITHM_HYPERPARAMETER)
475
476
    def compute_step(self, parameter, previous_step):
477
        mean_square_step_tm1 = _create_algorithm_buffer_for(
478
            parameter, "mean_square_step_tm1")
479
        mean_square_delta_x_tm1 = _create_algorithm_buffer_for(
480
            parameter, "mean_square_delta_x_tm1")
481
482
        mean_square_step_t = (
483
            self.decay_rate * mean_square_step_tm1 +
484
            (1 - self.decay_rate) * tensor.sqr(previous_step)
485
        )
486
487
        rms_delta_x_tm1 = tensor.sqrt(mean_square_delta_x_tm1 + self.epsilon)
488
        rms_step_t = tensor.sqrt(mean_square_step_t + self.epsilon)
489
        delta_x_t = rms_delta_x_tm1 / rms_step_t * previous_step
490
491
        mean_square_delta_x_t = (
492
            self.decay_rate * mean_square_delta_x_tm1 +
493
            (1 - self.decay_rate) * tensor.sqr(delta_x_t)
494
        )
495
496
        step = delta_x_t
497
        updates = [(mean_square_step_tm1, mean_square_step_t),
498
                   (mean_square_delta_x_tm1, mean_square_delta_x_t)]
499
        return step, updates
500
501
502
class BasicRMSProp(StepRule):
503
    """Scales the step size by a running average of the recent step norms.
504
505
    Parameters
506
    ----------
507
    decay_rate : float, optional
508
        How fast the running average decays, value in [0, 1]
509
        (lower is faster).  Defaults to 0.9.
510
    max_scaling : float, optional
511
        Maximum scaling of the step size, in case the running average is
512
        really small. Needs to be greater than 0. Defaults to 1e5.
513
514
    Notes
515
    -----
516
    This step rule is intended to be used in conjunction with another
517
    step rule, _e.g._ :class:`Scale`. For an all-batteries-included
518
    experience, look at :class:`RMSProp`.
519
520
    In general, this step rule should be used _before_ other step rules,
521
    because it has normalization properties that may undo their work.
522 View Code Duplication
    For instance, it should be applied first when used in conjunction
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
523
    with :class:`Scale`.
524
525
    For more information, see [Hint2014]_.
526
527
    """
528
    def __init__(self, decay_rate=0.9, max_scaling=1e5):
529
        if not 0.0 <= decay_rate <= 1.0:
530
            raise ValueError("decay rate needs to be in [0, 1]")
531
        if max_scaling <= 0:
532
            raise ValueError("max. scaling needs to be greater than 0")
533
        self.decay_rate = shared_floatx(decay_rate, "decay_rate")
534
        add_role(self.decay_rate, ALGORITHM_HYPERPARAMETER)
535
        self.epsilon = 1. / max_scaling
536
537
    def compute_step(self, parameter, previous_step):
538
        mean_square_step_tm1 = _create_algorithm_buffer_for(
539
            parameter, "mean_square_step_tm1")
540
        mean_square_step_t = (
541
            self.decay_rate * mean_square_step_tm1 +
542
            (1 - self.decay_rate) * tensor.sqr(previous_step))
543
        rms_step_t = tensor.maximum(
544
            tensor.sqrt(mean_square_step_t), self.epsilon)
545
        step = previous_step / rms_step_t
546
        updates = [(mean_square_step_tm1, mean_square_step_t)]
547
        return step, updates
548
549
550
class RMSProp(CompositeRule):
551
    """Scales the step size by a running average of the recent step norms.
552
553
    Combines :class:`BasicRMSProp` and :class:`Scale` to form the step rule
554
    described in [Hint2014]_.
555
556
    .. [Hint2014] Geoff Hinton, *Neural Networks for Machine Learning*,
557
       lecture 6a,
558
       http://cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
559
560
    Parameters
561
    ----------
562
    learning_rate : float, optional
563
        The learning rate by which the previous step scaled. Defaults to 1.
564
    decay_rate : float, optional
565
        How fast the running average decays (lower is faster).
566
        Defaults to 0.9.
567
    max_scaling : float, optional
568
        Maximum scaling of the step size, in case the running average is
569
        really small. Defaults to 1e5.
570
571
    Attributes
572
    ----------
573
    learning_rate : :class:`~tensor.SharedVariable`
574
        A variable for learning rate.
575
    decay_rate : :class:`~tensor.SharedVariable`
576
        A variable for decay rate.
577
578
    See Also
579
    --------
580
    :class:`SharedVariableModifier`
581
582
    """
583
    def __init__(self, learning_rate=1.0, decay_rate=0.9, max_scaling=1e5):
584
        basic_rms_prop = BasicRMSProp(decay_rate=decay_rate,
585
                                      max_scaling=max_scaling)
586
        scale = Scale(learning_rate=learning_rate)
587
        self.learning_rate = scale.learning_rate
588
        self.decay_rate = basic_rms_prop.decay_rate
589
        self.components = [basic_rms_prop, scale]
590
591
592
class StepClipping(StepRule):
593
    """Rescales an entire step if its L2 norm exceeds a threshold.
594
595
    When the previous steps are the gradients, this step rule performs
596
    gradient clipping.
597
598
    Parameters
599
    ----------
600
    threshold : float, optional
601
        The maximum permitted L2 norm for the step. The step
602
        will be rescaled to be not higher than this quanity.
603
        If ``None``, no rescaling will be applied.
604
605
    Attributes
606
    ----------
607
    threshold : :class:`.tensor.TensorSharedVariable`
608
        The shared variable storing the clipping threshold used.
609
610
    """
611
    def __init__(self, threshold=None):
612
        if threshold:
613
            self.threshold = shared_floatx(threshold, "threshold")
614
            add_role(self.threshold, ALGORITHM_HYPERPARAMETER)
615
616
    def compute_steps(self, previous_steps):
617
        if not hasattr(self, 'threshold'):
618
            return previous_steps
619
        norm = l2_norm(previous_steps.values())
620
        multiplier = tensor.switch(norm < self.threshold,
621
                                   1, self.threshold / norm)
622
        steps = OrderedDict(
623
            (parameter, step * multiplier)
624
            for parameter, step in previous_steps.items())
625
        return steps, []
626
627
628
class VariableClipping(StepRule):
629
    """Clip the maximum norm of individual variables along certain axes.
630
631
    This :class:`StepRule` can be used to implement L2 norm constraints on
632
    e.g. the weight vectors of individual hidden units, convolutional
633
    filters or entire weight tensors. Combine with :class:`Restrict`
634
    (and possibly :class:`CompositeRule`), to apply such constraints only
635
    to certain variables and/or apply different norm constraints to
636
    different variables.
637
638
    Parameters
639
    ----------
640
    threshold : float
641
        Maximum norm for a given (portion of a) tensor.
642
    axis : int or iterable, optional
643
        An integer single axis, or an iterable collection of integer
644
        axes over which to sum in order to calculate the L2 norm. If
645
        `None` (the default), the norm is computed over all elements
646
        of the tensor.
647
648
    Notes
649
    -----
650
    Because of the way the :class:`StepRule` API works, this particular
651
    rule implements norm clipping of the value *after* update in the
652
    following way: it computes ``parameter - previous_step``, scales it
653
    to have (possibly axes-wise) norm(s) of at most `threshold`,
654
    then subtracts *that* value from `parameter` to yield an 'equivalent
655
    step' that respects the desired norm constraints. This procedure
656
    implicitly assumes one is doing simple (stochastic) gradient descent,
657
    and so steps computed by this step rule may not make sense for use
658
    in other contexts.
659
660
    Investigations into max-norm regularization date from [Srebro2005]_.
661
    The first appearance of this technique as a regularization method
662
    for the weight vectors of individual hidden units in feed-forward
663
    neural networks may be [Hinton2012]_.
664
665
    .. [Srebro2005] Nathan Srebro and Adi Shraibman.
666
       "Rank, Trace-Norm and Max-Norm". *18th Annual Conference
667
       on Learning Theory (COLT)*, June 2005.
668
669
    .. [Hinton2012] Geoffrey E. Hinton, Nitish Srivastava,
670
       Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov.
671
       "Improving neural networks by preventing co-adaptation of
672
       feature detectors". arXiv:1207.0580.
673
674
    """
675
    def __init__(self, threshold, axis=None):
676
        axis = pack(axis) if axis is not None else ()
677
        self.axis = set(axis)
678
        self.threshold = shared_floatx(threshold, "threshold")
679
        add_role(self.threshold, ALGORITHM_HYPERPARAMETER)
680
        if len(axis) != len(self.axis):
681
            raise ValueError("axis must be unique")
682
683
    def compute_step(self, parameter, previous_step):
684
        if any(ax >= previous_step.ndim for ax in self.axis):
685
            raise ValueError("Invalid axis {} for {}, ndim={}".format(
686
                self.axis, parameter, previous_step.ndim))
687
        if len(self.axis) == 0:
688
            norms = l2_norm([parameter - previous_step])
689
        else:
690
            squares = tensor.sqr(parameter - previous_step)
691
            norms = tensor.sqrt(
692
                reduce(lambda t, a: t.sum(axis=a, keepdims=True),
693
                       sorted(self.axis), squares))
694
        # We want a step s* that is the same as scaling
695
        # (parameter - previous_step) by threshold / norm
696
        # when threshold < norm.
697
        shrinking_step = (parameter -
698
                          (self.threshold / norms) *
699
                          (parameter - previous_step))
700
        return tensor.switch(norms > self.threshold,
701
                             shrinking_step,
702
                             previous_step), ()
703
704
705
class AdaGrad(StepRule):
706
    """Implements the AdaGrad learning rule.
707
708
    Parameters
709
    ----------
710
    learning_rate : float, optional
711
        Step size.
712
        Default value is set to 0.0002.
713
    epsilon : float, optional
714
        Stabilizing constant for one over root of sum of squares.
715
        Defaults to 1e-6.
716
717
    Notes
718
    -----
719
    For more information, see [ADAGRAD]_.
720
721
    .. [ADADGRAD] Duchi J, Hazan E, Singer Y.,
722
       *Adaptive subgradient methods for online learning and
723
        stochastic optimization*,
724
       http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
725 View Code Duplication
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
726
    """
727
    def __init__(self, learning_rate=0.002, epsilon=1e-6):
728
        self.learning_rate = shared_floatx(learning_rate, "learning_rate")
729
        self.epsilon = shared_floatx(epsilon, "epsilon")
730
        add_role(self.learning_rate, ALGORITHM_HYPERPARAMETER)
731
        add_role(self.epsilon, ALGORITHM_HYPERPARAMETER)
732
733
    def compute_step(self, parameter, previous_step):
734
        name = 'adagrad_sqs'
735
        if parameter.name:
736
            name += '_' + parameter.name
737
        ssq = _create_algorithm_buffer_for(parameter, name=name)
738
739
        ssq_t = (tensor.sqr(previous_step) + ssq)
740
        step = (self.learning_rate * previous_step /
741
                (tensor.sqrt(ssq_t) + self.epsilon))
742
743
        updates = [(ssq, ssq_t)]
744
745
        return step, updates
746
747
748
class Adam(StepRule):
749
    """Adam optimizer as described in [King2014]_.
750
751
    .. [King2014] Diederik Kingma, Jimmy Ba,
752
       *Adam: A Method for Stochastic Optimization*,
753
       http://arxiv.org/abs/1412.6980
754
755
    Parameters
756
    ----------
757
    learning_rate : float, optional
758
        Step size.
759
        Default value is set to 0.002.
760
    beta1 : float, optional
761
        Exponential decay rate for the first moment estimates.
762
        Default value is set to 0.1.
763
    beta2 : float, optional
764
        Exponential decay rate for the second moment estimates.
765
        Default value is set to 0.001.
766
    epsilon : float, optional
767
        Default value is set to 1e-8.
768
    decay_factor : float, optional
769
        Default value is set to 1 - 1e-8.
770
771
    """
772
    def __init__(self, learning_rate=0.002,
773
                 beta1=0.1, beta2=0.001, epsilon=1e-8,
774
                 decay_factor=(1 - 1e-8)):
775
        self.learning_rate = shared_floatx(learning_rate, "learning_rate")
776
        self.beta1 = shared_floatx(beta1, "beta1")
777
        self.beta2 = shared_floatx(beta2, "beta2")
778
        self.epsilon = shared_floatx(epsilon, "epsilon")
779
        self.decay_factor = shared_floatx(decay_factor, "decay_factor")
780
        for param in [self.learning_rate, self.beta1, self.beta2, self.epsilon,
781
                      self.decay_factor]:
782
            add_role(param, ALGORITHM_HYPERPARAMETER)
783
784
    def compute_step(self, parameter, previous_step):
785
        mean = _create_algorithm_buffer_for(parameter, 'mean')
786
        variance = _create_algorithm_buffer_for(parameter, 'variance')
787
        time = shared_floatx(0., 'time')
788
        add_role(time, ALGORITHM_BUFFER)
789
790
        t1 = time + 1
791
        learning_rate = (self.learning_rate *
792
                         tensor.sqrt((1. - (1. - self.beta2)**t1)) /
793
                         (1. - (1. - self.beta1)**t1))
794
        beta_1t = 1 - (1 - self.beta1) * self.decay_factor ** (t1 - 1)
795
        mean_t = beta_1t * previous_step + (1. - beta_1t) * mean
796
        variance_t = (self.beta2 * tensor.sqr(previous_step) +
797
                      (1. - self.beta2) * variance)
798
        step = (learning_rate * mean_t /
799
                (tensor.sqrt(variance_t) + self.epsilon))
800
801
        updates = [(mean, mean_t),
802
                   (variance, variance_t),
803
                   (time, t1)]
804
805
        return step, updates
806
807
808
class RemoveNotFinite(StepRule):
809
    """A step rule that skips steps with non-finite elements.
810
811
    Replaces a step (the parameter update of a single shared variable)
812
    which contains non-finite elements (such as ``inf`` or ``NaN``) with a
813
    step rescaling the parameters.
814
815
    Parameters
816
    ----------
817
    scaler : float, optional
818
        The scaling applied to the parameter in case the step contains
819
        non-finite elements. Defaults to 1, which means that parameters
820
        will not be changed.
821
822
    Notes
823
    -----
824
    This rule should be applied last!
825
826
    This trick was originally used in the GroundHog_ framework.
827
828
    .. _GroundHog: https://github.com/lisa-groundhog/GroundHog
829
830
    """
831
    def __init__(self, scaler=1):
832
        self.scaler = scaler
833
834
    def compute_step(self, parameter, previous_step):
835
        step_sum = tensor.sum(previous_step)
836
        not_finite = (tensor.isnan(step_sum) +
837
                      tensor.isinf(step_sum))
838
        step = tensor.switch(
839
            not_finite > 0, (1 - self.scaler) * parameter, previous_step)
840
        return step, []
841
842
843
class Restrict(StepRule):
844
    """Applies a given :class:`StepRule` only to certain variables.
845
846
    Example applications include clipping steps on only certain parameters,
847
    or scaling a certain kind of parameter's updates (e.g. adding an
848
    additional scalar multiplier to the steps taken on convolutional
849
    filters).
850
851
    Parameters
852
    ----------
853
    step_rule : :class:`StepRule`
854
        The :class:`StepRule` to be applied on the given variables.
855
    variables : iterable
856
        A collection of Theano variables on which to apply `step_rule`.
857
        Variables not appearing in this collection will not have
858
        `step_rule` applied to them.
859
860
    """
861
    def __init__(self, step_rule, variables):
862
        self.step_rule = step_rule
863
        self.variables = frozenset(variables)
864
865
    def compute_steps(self, previous_steps):
866
        filtered_previous_steps = dict_subset(previous_steps, self.variables)
867
        steps, updates = self.step_rule.compute_steps(filtered_previous_steps)
868
        actual = OrderedDict((parameter, steps[parameter])
869
                             if parameter in steps
870
                             else (parameter, previous_steps[parameter])
871
                             for parameter in previous_steps)
872
        return actual, updates
873