Artificial Neural Networks with Random Weights are Baseline Models

Running untrained networks with random weights allows us to understand such models before training.

Where do the impressive performance gains of deep neural networks come from? Is their power due to the learning rules which adjust the connection weights or is it simply a function of the network architecture (i.e., many layers)? These two properties of networks are hard to disentangle. One way to tease apart the contributions of network architecture versus those of the learning regimen is to consider networks with randomised weights. To the extent that random networks show interesting behaviors, we can infer that the learning rule has not played a role in them. At the same time, examining these random networks allows us to evaluate what learning does add to the network’s abilities over and above minimising some loss function.

Figure 2A from Guest and Love (2017): "For the artificial neural network coding schemes, similarity to the prototype falls off with increasing distortion (i.e., noise). The models, numbered 1–11, are (1) vector space coding, (2) gain control coding, (3) matrix multiplication coding, (4), perceptron coding, (5) 2-layer network, (6) 3-layer network, (7) 4-layer network, (8) 5-layer network, (9) 6-layer network (10) 7-layer network, and (11), 8-layer network. The darker a model is, the simpler the model is and the more the model preserves similarity structure under fMRI."

In What the Success of Brain Imaging Implies about the Neural Code, we examined an artificial deep neural network, Inception-v3 GoogLeNet. This deep trained network, preserves the similarity of the input space and thus is functionally smooth. Importantly, however, we found that functional smoothness in this deep network breaks down at later layers. Is this because of the depth of the network, the many layers, or the specific learning regimen? We sought to explain why this happens by using a baseline, a model with random weights.

To answer this question, let us consider some much simpler plausible contenders for the neural code — a rudimentary set of models — the components of artificial neural networks: matrix multiplication and some kind of squashing (sigmoid, step, etc.) function (in our case, the hyperbolic tangent).

The first basic model, matrix multiplication, is how neural networks propagate activation from layer to the next via the weights . For simplicity, our toy network contains layers and , which both contain three units. Thus to calculate the states for , we take the matrix product of the previous layer and the weights :

where s represent the units in layer , represents a weight in from unit in layer to unit in , and is a unit in . For example, is the weight on the connection between the third unit of the shallower/earlier layer and the first unit of the deeper/later later (others use other notations).

Matrix multiplication calculates the states of a layer — easily done in Python using NumPy, specifically numpy.dot():

import numpy as np
m = np.asarray([0.1, 0.2, 1.3]) # layer m with some dummy input
w = np.random.randn(3, 3) # random weights from m to n
n = np.dot(m, w) # pre-synaptic states in n
print(n)

To apply a squashing function, , to n above, we may use numpy.tanh():

n = np.tanh(n) # post-synaptic states in n
print(n)

Non-linear transformations like hyperbolic tangent allow the network to have non-linear decision boundaries, e.g., between classes, making it able to capturing the statistics of the training set (more here and here).

In Guest and Love (2017) we presented the above as two separate models as well as a combined model, here I have cut to the part where they are combined to form a traditional two-layer network (also known as the perceptron model). As you might have guessed, from two layers we can generalise to many, by continuing to take the matrix product of the output (n in the code above) with some new weights, and so on.

Running an untrained neural network with random weights allows us to compare more complex (i.e., trained models) with their untrained selves. We can thus pick apart what aspects of the model are inherent to the architecture itself and which emerge as a function of training. Networks that have random weights can be given the same training and test sets, although importantly no training has happened yet, and we can examine their internal states and outputs This can serve as a guide to understand what the network “knows” a priori.

As we noted in Guest and Love (2017), networks naturally place items close together in their internal representational space that are similar/proximal in the input space. Hence why artificial neural networks are a plausible candidate for the neural code, i.e., they give rise to functionally smooth representations. The simple network above can be made deeper and deeper, and we can inspect every layer in it for smoothness for every pattern. Extending the above, we can do just that, and run the network on two very simple categories:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np

prototypes = np.random.randn(2, 100) # two toy categories
members = 10 # how many items per category
patterns = []

for p in prototypes:
for i, m in enumerate(range(members)):
# for each item, create a pattern that has noise as a function of the
# number of items. First item in category has no noise, then 0.05 SD of
# noise, then 0.1 SD, and so on.
patterns.append(p + 0.01 * i * np.random.randn((len(p))))

layers = 20 # how many layers we want, i.e., how deep is the network
# random weights:
w = np.random.randn(layers, len(prototypes[0]), len(prototypes[0])) * 0.1

for pat in patterns:
# for each pattern
for i, l in enumerate(range(layers)):
if i == 0:
#if we are at the input layer, then set units to pattern
n = pat
# propagate through each layer
n = np.dot(n, w[i]) # pre-synaptic states in n
n = np.tanh(n) # post-synaptic states in n
if i == layers-1:
# print the layer, the first five features of the pattern applied at
# input and the first five activations in the last layer
print i, pat[0:5], n[0:5]

Even just by eye-balling the output in the terminal using the code above, we can see that indeed similar items (items within the same category) map to similar outputs, i.e., the network is functionally smooth without any training. We used a more complex version of the above to demonstrate this principle in Guest and Love (2017), where we calculate the correlations between the representations in the input space and in each layer. However, as we move deeper into the network, we see that functional smoothness has broken down and the network gives for all intents and purposes identical outputs for each items within a category, thus losing all structure within it. We cannot looking just at the output, predict which input generated it, only which category.

Using this result we can infer that the property of Inception-v3 GoogLeNet, and indeed any similar deep network, which causes it to both display (at early layers) and gradually lose functional smoothness (at deeper layers), is due to the nature of the architecture and not the learning rule. Because this property is present in simple untrained networks, it cannot be a byproduct of training.

Importantly, randomising weights can be done to any network with any topology, including to Inception-v3 GoogLeNet itself, to recurrent networks, and so on. We hope this idea proves to be a useful exercise to others too, as many connectionist and deep network accounts would benefit from an understanding of the inherent properties of the topological configuration versus the fully-trained model.