true," "dir"=">"/img/posts/ann_models_correlation"}.png"" > figure 2a from guest and love (2017): "for the artificial neural network coding schemes, similarity to prototype falls off with increasing distortion (i.e., noise). models, numbered 1–11, are (1) vector space coding, (2) gain control (3) matrix multiplication (4), perceptron (5) 2-layer network, (6) 3-layer (7) 4-layer (8) 5-layer (9) 6-layer (10) 7-layer (11), 8-layer network. darker a model is, simpler is more preserves structure under fmri." in what success of brain imaging implies about code, we examined an deep inception-v3 googlenet. this trained input thus functionally smooth. importantly, however, found that functional smoothness breaks down at later layers. because depth many layers, or specific learning regimen? sought explain why happens by using baseline, random weights. answer question, let us consider some much plausible contenders for code — rudimentary set models components networks: kind squashing (sigmoid, step, etc.) function (in our case, hyperbolic tangent). first basic model, multiplication, how networks propagate activation layer next via weights . simplicity, toy contains layers , which both contain three units. calculate states take product previous : where s represent units represents weight unit example, on connection between third shallower earlier deeper (others use other notations). calculates easily done python numpy, specifically numpy.dot(): import numpy as np m="np.asarray([0.1," 0.2, 1.3]) # dummy w="np.random.randn(3," 3) n w) pre-synaptic print(n) apply function, above, may numpy.tanh(): post-synaptic non-linear transformations like tangent allow have decision boundaries, e.g., classes, making it able capturing statistics training (more here here). (2017) presented above two separate well combined i cut part they form traditional two-layer (also known model). you might guessed, can generalise many, continuing output (n above) new weights, so on. running untrained allows compare complex models) their selves. pick apart aspects inherent architecture itself emerge training. be given same test sets, although importantly no has happened yet, examine internal outputs serve guide understand “knows” priori. noted (2017), naturally place items close together representational similar proximal space. hence candidate i.e., give rise smooth representations. simple made deeper, inspect every pattern. extending do just that, run very categories: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30import prototypes="np.random.randn(2," 100) categories members="10" per category patterns="[]" p prototypes: i, enumerate(range(members)): each item, create pattern noise number items. item noise, then 0.05 sd 0.1 sd, patterns.append(p + 0.01 * np.random.randn((len(p)))) want, weights: len(prototypes[0]), len(prototypes[0])) pat patterns: l enumerate(range(layers)): if 0: #if layer, through w[i]) layers-1: print five features applied activations last pat[0:5], n[0:5] even eye-balling terminal see indeed (items within category) map outputs, without any used version demonstrate principle correlations representations layer. move into broken gives all intents purposes identical category, losing it. cannot looking output, predict generated it, only category. result infer property googlenet, causes display (at early layers) gradually lose layers), due nature not rule. present networks, byproduct randomising topology, including googlenet itself, recurrent hope idea proves useful exercise others too, connectionist accounts would benefit understanding properties topological configuration versus fully-trained model. ">
true," "dir"=">"/img/posts/ann_models_correlation"}.png"" > figure 2a from guest and love (2017): "for the artificial neural network coding schemes, similarity to prototype falls off with increasing distortion (i.e., noise). models, numbered 1–11, are (1) vector space coding, (2) gain control (3) matrix multiplication (4), perceptron (5) 2-layer network, (6) 3-layer (7) 4-layer (8) 5-layer (9) 6-layer (10) 7-layer (11), 8-layer network. darker a model is, simpler is more preserves structure under fmri." in what success of brain imaging implies about code, we examined an deep inception-v3 googlenet. this trained input thus functionally smooth. importantly, however, found that functional smoothness breaks down at later layers. because depth many layers, or specific learning regimen? sought explain why happens by using baseline, random weights. answer question, let us consider some much plausible contenders for code — rudimentary set models components networks: kind squashing (sigmoid, step, etc.) function (in our case, hyperbolic tangent). first basic model, multiplication, how networks propagate activation layer next via weights . simplicity, toy contains layers , which both contain three units. calculate states take product previous : where s represent units represents weight unit example, on connection between third shallower earlier deeper (others use other notations). calculates easily done python numpy, specifically numpy.dot(): import numpy as np m="np.asarray([0.1," 0.2, 1.3]) # dummy w="np.random.randn(3," 3) n w) pre-synaptic print(n) apply function, above, may numpy.tanh(): post-synaptic non-linear transformations like tangent allow have decision boundaries, e.g., classes, making it able capturing statistics training (more here here). (2017) presented above two separate well combined i cut part they form traditional two-layer (also known model). you might guessed, can generalise many, continuing output (n above) new weights, so on. running untrained allows compare complex models) their selves. pick apart aspects inherent architecture itself emerge training. be given same test sets, although importantly no has happened yet, examine internal outputs serve guide understand “knows” priori. noted (2017), naturally place items close together representational similar proximal space. hence candidate i.e., give rise smooth representations. simple made deeper, inspect every pattern. extending do just that, run very categories: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30import prototypes="np.random.randn(2," 100) categories members="10" per category patterns="[]" p prototypes: i, enumerate(range(members)): each item, create pattern noise number items. item noise, then 0.05 sd 0.1 sd, patterns.append(p + 0.01 * np.random.randn((len(p)))) want, weights: len(prototypes[0]), len(prototypes[0])) pat patterns: l enumerate(range(layers)): if 0: #if layer, through w[i]) layers-1: print five features applied activations last pat[0:5], n[0:5] even eye-balling terminal see indeed (items within category) map outputs, without any used version demonstrate principle correlations representations layer. move into broken gives all intents purposes identical category, losing it. cannot looking output, predict generated it, only category. result infer property googlenet, causes display (at early layers) gradually lose layers), due nature not rule. present networks, byproduct randomising topology, including googlenet itself, recurrent hope idea proves useful exercise others too, connectionist accounts would benefit understanding properties topological configuration versus fully-trained model. ">
true," "dir"=">"/img/posts/ann_models_correlation"}.png"" > figure 2a from guest and love (2017): "for the artificial neural network coding schemes, similarity to prototype falls off with increasing distortion (i.e., noise). models, numbered 1–11, are (1) vector space coding, (2) gain control (3) matrix multiplication (4), perceptron (5) 2-layer network, (6) 3-layer (7) 4-layer (8) 5-layer (9) 6-layer (10) 7-layer (11), 8-layer network. darker a model is, simpler is more preserves structure under fmri." in what success of brain imaging implies about code, we examined an deep inception-v3 googlenet. this trained input thus functionally smooth. importantly, however, found that functional smoothness breaks down at later layers. because depth many layers, or specific learning regimen? sought explain why happens by using baseline, random weights. answer question, let us consider some much plausible contenders for code — rudimentary set models components networks: kind squashing (sigmoid, step, etc.) function (in our case, hyperbolic tangent). first basic model, multiplication, how networks propagate activation layer next via weights . simplicity, toy contains layers , which both contain three units. calculate states take product previous : where s represent units represents weight unit example, on connection between third shallower earlier deeper (others use other notations). calculates easily done python numpy, specifically numpy.dot(): import numpy as np m="np.asarray([0.1," 0.2, 1.3]) # dummy w="np.random.randn(3," 3) n w) pre-synaptic print(n) apply function, above, may numpy.tanh(): post-synaptic non-linear transformations like tangent allow have decision boundaries, e.g., classes, making it able capturing statistics training (more here here). (2017) presented above two separate well combined i cut part they form traditional two-layer (also known model). you might guessed, can generalise many, continuing output (n above) new weights, so on. running untrained allows compare complex models) their selves. pick apart aspects inherent architecture itself emerge training. be given same test sets, although importantly no has happened yet, examine internal outputs serve guide understand “knows” priori. noted (2017), naturally place items close together representational similar proximal space. hence candidate i.e., give rise smooth representations. simple made deeper, inspect every pattern. extending do just that, run very categories: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30import prototypes="np.random.randn(2," 100) categories members="10" per category patterns="[]" p prototypes: i, enumerate(range(members)): each item, create pattern noise number items. item noise, then 0.05 sd 0.1 sd, patterns.append(p + 0.01 * np.random.randn((len(p)))) want, weights: len(prototypes[0]), len(prototypes[0])) pat patterns: l enumerate(range(layers)): if 0: #if layer, through w[i]) layers-1: print five features applied activations last pat[0:5], n[0:5] even eye-balling terminal see indeed (items within category) map outputs, without any used version demonstrate principle correlations representations layer. move into broken gives all intents purposes identical category, losing it. cannot looking output, predict generated it, only category. result infer googlenet, causes display (at early layers) gradually lose layers), due nature not rule. present networks, byproduct randomising topology, including googlenet itself, recurrent hope idea proves useful exercise others too, connectionist accounts would benefit understanding properties topological configuration versus fully-trained model. ">
neuroplausible: Artificial Neural Networks with Random Weights are Baseline Models
Where do the impressive performance gains of deep neural networks come from?
Is their power due to the learning rules which adjust the connection weights or is it simply a function of the network architecture (i.e., many layers)?
These two properties of networks are hard to disentangle.
One way to tease apart the contributions of network architecture versus those of the learning regimen is to consider networks with randomised weights.
To the extent that random networks show interesting behaviors, we can infer that the learning rule has not played a role in them.
At the same time, examining these random networks allows us to evaluate what learning does add to the network’s abilities over and above minimising some loss function.
Figure 2A from Guest and Love (2017): "For the artificial neural network coding schemes, similarity to the prototype falls off with increasing distortion (i.e., noise). The models, numbered 1–11, are (1) vector space coding, (2) gain control coding, (3) matrix multiplication coding, (4), perceptron coding, (5) 2-layer network, (6) 3-layer network, (7) 4-layer network, (8) 5-layer network, (9) 6-layer network (10) 7-layer network, and (11), 8-layer network. The darker a model is, the simpler the model is and the more the model preserves similarity structure under fMRI."
In What the Success of Brain Imaging Implies about the Neural Code, we examined an artificial deep neural network, Inception-v3 GoogLeNet.
This deep trained network, preserves the similarity of the input space and thus is functionally smooth.
Importantly, however, we found that functional smoothness in this deep network breaks down at later layers.
Is this because of the depth of the network, the many layers, or the specific learning regimen?
We sought to explain why this happens by using a baseline, a model with random weights.
To answer this question, let us consider some much simpler plausible contenders for the neural code — a rudimentary set of models — the components of artificial neural networks: matrix multiplication and some kind of squashing (sigmoid, step, etc.) function (in our case, the hyperbolic tangent).
The first basic model, matrix multiplication, is how neural networks propagate activation from layer to the next via the weights .
For simplicity, our toy network contains layers and , which both contain three units.
Thus to calculate the states for , we take the matrix product of the previous layer and the weights :
where s represent the units in layer , represents a weight in from unit in layer to unit in , and is a unit in . For example, is the weight on the connection between the third unit of the shallower/earlier layer and the first unit of the deeper/later later (others use other notations).
Matrix multiplication calculates the states of a layer — easily done in Python using NumPy, specifically numpy.dot():
To apply a squashing function, , to n above, we may use numpy.tanh():
Non-linear transformations like hyperbolic tangent allow the network to have non-linear decision boundaries, e.g., between classes, making it able to capturing the statistics of the training set (more here and here).
In Guest and Love (2017) we presented the above as two separate models as well as a combined model, here I have cut to the part where they are combined to form a traditional two-layer network (also known as the perceptron model).
As you might have guessed, from two layers we can generalise to many, by continuing to take the matrix product of the output (n in the code above) with some new weights, and so on.
Running an untrained neural network with random weights allows us to compare more complex (i.e., trained models) with their untrained selves.
We can thus pick apart what aspects of the model are inherent to the architecture itself and which emerge as a function of training.
Networks that have random weights can be given the same training and test sets, although importantly no training has happened yet, and we can examine their internal states and outputs
This can serve as a guide to understand what the network “knows” a priori.
As we noted in Guest and Love (2017), networks naturally place items close together in their internal representational space that are similar/proximal in the input space. Hence why artificial neural networks are a plausible candidate for the neural code, i.e., they give rise to functionally smooth representations.
The simple network above can be made deeper and deeper, and we can inspect every layer in it for smoothness for every pattern.
Extending the above, we can do just that, and run the network on two very simple categories:
importnumpyasnpprototypes=np.random.randn(2,100)# two toy categoriesmembers=10# how many items per categorypatterns=[]forpinprototypes:fori,minenumerate(range(members)):# for each item, create a pattern that has noise as a function of the# number of items. First item in category has no noise, then 0.05 SD of# noise, then 0.1 SD, and so on.patterns.append(p+0.01*i*np.random.randn((len(p))))layers=20# how many layers we want, i.e., how deep is the network# random weights:w=np.random.randn(layers,len(prototypes[0]),len(prototypes[0]))*0.1forpatinpatterns:# for each patternfori,linenumerate(range(layers)):ifi==0:#if we are at the input layer, then set units to patternn=pat# propagate through each layern=np.dot(n,w[i])# pre-synaptic states in nn=np.tanh(n)# post-synaptic states in nifi==layers-1:# print the layer, the first five features of the pattern applied at# input and the first five activations in the last layerprinti,pat[0:5],n[0:5]
Even just by eye-balling the output in the terminal using the code above, we can see that indeed similar items (items within the same category) map to similar outputs, i.e., the network is functionally smooth without any training. We used a more complex version of the above to demonstrate this principle in Guest and Love (2017), where we calculate the correlations between the representations in the input space and in each layer.
However, as we move deeper into the network, we see that functional smoothness has broken down and the network gives for all intents and purposes identical outputs for each items within a category, thus losing all structure within it.
We cannot looking just at the output, predict which input generated it, only which category.
Using this result we can infer that the property of Inception-v3 GoogLeNet, and indeed any similar deep network, which causes it to both display (at early layers) and gradually lose functional smoothness (at deeper layers), is due to the nature of the architecture and not the learning rule.
Because this property is present in simple untrained networks, it cannot be a byproduct of training.
Importantly, randomising weights can be done to any network with any topology, including to Inception-v3 GoogLeNet itself, to recurrent networks, and so on.
We hope this idea proves to be a useful exercise to others too, as many connectionist and deep network accounts would benefit from an understanding of the inherent properties of the topological configuration versus the fully-trained model.