In the first place there is a problem with the notion of the visual neuron as a feature detector that operates by way of a hard-wired receptive field of excitatory and inhibitory synapses anchored to the tissue of the brain. For this concept is no different than a template theory, the limitations of which are well known. A template is a spatial map of the pattern to be matched, which is inherently intolerant to any variation in the stimulus pattern. For example a mismatch will be recorded if the pattern is presented at a different location, orientation, or spatial scale than that encoded in the template.
The solution to the problem of invariance commonly proposed in neural modeling is a feature based approach, i.e. to break the pattern into its component features, and detect those local features independently of the whole (Selfridge 1959, Marr 1982, Biederman 1987). Very simple features such as oriented edges, bars, or corners, are sufficiently elemental that it would not be prohibitive to provide templates for them at every location and orientation across the visual field. In the purest form of this concept, the spatial match represented by the single global template is replaced by an enumerative match that tallies the number and type of local features present in some region of the visual field, and matches this list against the list of features characteristic of the global form. For example a square might be defined by the presence of four corners, each of which might be detected by a local corner detector applied at every location throughout a local region of the image. The enumerative listing of four corner features would be the same for squares of different rotations, translations, and scales, and therefore the feature list as a representation is invariant to rotation, translation, and scale.
Despite the current popularity of the feature detector concept in neural network models, the fundamental limitations of this approach to perception were pointed out decades ago by Gestalt theory. In the first place, local features cannot be reliably identified in the absence of the global context. For example a corner detector in computer simulations will typically generate countless corner responses in a natural scene, only a small fraction of which would be identified as legitimate corner features in the global context. Another problem with the feature based approach is that in the tally of detected features, it is impossible to determine reliably which features belong to which objects. Whatever local region is selected for the tally of detected features, might just as well include features from several different objects to confound the feature list, and conversely, the object centered on that region will often extend out beyond the region, and thereby lose critical features from its feature list. A pure feature based system would also be easily misled by spatial occlusions, which occur commonly in visual scenes, but appear to pose no serious problem to visual recognition.
Hybrid solutions have also been proposed in which the object template is defined as a pattern of regions, each of which represents an approximate locus for a particular feature (Selfridge 1959, Biederman 1987). For example a square might be defined as four circular regions around a center, each of which defines the possible range of a corner feature at that point, which would be searched out by corner detectors applied through a range of orientations throughout each of those regions. The positional and orientational tolerance afforded by this scheme allows a multitude of different variations of a square to stimulate the same square template. While the object template is thus rendered somewhat rotation and scale sensitive, the tolerance allowed in its component features permits a smaller number of object templates than would be required in a simple template model to recognize all possible variations of the square.
However the hybrid scheme is also fundamentally flawed, because the feature-based detection of the individual component features suffers the same problems as a pure feature-based scheme, being easily confused by extraneous features in each region of interest, while the template-like detection of the global configuration of those local features suffers all the problems inherent in a template-based scheme, with hard limits to the range of variability of each component feature. A more fundamental problem with this concept is that the range of legitimate locations and orientations for any component feature cannot be defined in the abstract, but only relative to the other features actually present. For example if one corner of a square is detected, the exact location and orientation of that corner constrains the permissible locations of the other three corners much more precisely than would be encoded in the object template. Therefore there are many possible configurations of corners that would register to the hybrid model as a square, only a small fraction of which would correspond to legitimate squares.
The feature based approach to visual recognition can be implemented relatively easily in computer algorithms (Ballard & Brown 1982, Marr 1982). However despite decades of the most intensive research, no algorithm has ever been devised that can perform reliably except in the most controlled visual environments. The problem with both the feature based model and with the hybrid model is that they confuse invariance to stimulus variation with a blindness to those variations. For the hybrid square detector that responds to a square knows little about the exact configuration of the corners of that particular square, and the enumerative feature detector knows even less. This is in contrast to our subjective experience in which the region of the visual field that is recognized as belonging to a square is perceived to as high a resolution as the edges of the square itself, even when those edges are not actually present in the stimulus, like the illusory sides of a Kanizsa figure. Furthermore, we can easily indicate where an occluded or missing corner of a square or triangle ought to be located, based on the configuration of the rest of the figure.
The problems inherent in template and feature based detection apply not only to invariance in the perception of simple objects and their component features, but to the whole concept of a featural hierarchy, extending up to higher order complex objects or concepts. For the principle of invariance implies a many-to-one relation between the many possible stimulus variations that all indicate the one recognized object. What is required is a kind of top-down completion that makes use of the higher level recognition of the object to determine what its expected component parts should be. But this feedback is complicated by the many-to-one relation in the bottom-up direction, because a simplistic top-down feedback from the invariant recognition node would involve a one-to-many relation to activate every possible combination of local feature nodes that can ever trigger that invariant node. If on the other hand the top-down feedback is only directed to feature nodes which have actually detected some feature, dthis would preclude the perceptual filling-in of features absent from the stimulus, and thereby defeat the whole purpose of the feedback.
Although the idea of visual processing as a feed-forward progression through a hierarchical architecture represents the most direct or simplistic form of the neuron doctrine, there has been a growing awareness of the need for some kind of complementary top-down processing function, both on perceptual grounds to account for expectation and perceptual completion as seen in the Kanizsa figure, and on neurophysiological grounds to account for the reciprocal feedback pathways identified neurophysiologically, running from higher to lower cortical areas. Several theorists have proposed neural network models to greater or lesser degree of computational specificity that incorporate some kind of feedback function (Fukushima 1987, Carpenter & Grossberg 1987, Grossberg & Mingolla 1985, 1987, Damasio 1989). Unfortunately these models have been persistently handicapped by the template-like concept of the neural receptive field inherent in the neuron doctrine, which makes it impossible for them to provide an adequate account of the joint properties of invariance in recognition, and specificity in completion phenomena.
Perhaps the most explicit model of neural feedback is seen in the Adaptive Resonance Theory (ART, Carpenter & Grossberg 1987). The principal focus of ART is on the manner in which a neural network model detects novelty in a stream of input patterns, and uses that information to categorize the input patterns on the basis of novelty. The significant property of this model in the present context is not in the details of its learning mechanism, but in the manner in which bottom-up information is mixed with top-down information stored in the learned synaptic weights, as a model of cognitive expectation or perceptual completion. The significant feature of the ART model is that the pattern recognition nodes in what is called the F2 layer, are equipped not only with bottom-up receptive fields for pattern recognition, but also with projective fields that propagate top-down back to the input or F1 layer, and the pattern of synaptic weights in these projective fields generally match the bottom-up weights used for recognition. If, after learning is complete, a partial or incomplete pattern is presented at the input, that pattern will stimulate the activation of the single F2 node whose synaptic weights best match the input pattern. Top-down feedback from that F2 node will in turn impinge its pattern back on the F1 layer, filling-in or completing even the missing portion of the pattern, in a manner that is suggestive of perceptual completion of missing or occluded portions of a recognized object. The fact that the F2 nodes encode whole categories of similar patterns, rather than exact single patterns, embodies a kind of invariance in the model to the variations between patterns of the same category.
The invariance embodied in the principle of adaptive resonance differs fundamentally from the invariance observed in perception, because the synaptic weights of the F2 node after learning several patterns, encode only a single pattern at a fixed location in the model, and that pattern is a kind of average, or blurring together of all of the patterns that belong to that category. In other words the system would behave much the same if all of the patterns of a particular category were first averaged together and then learned as a single pattern, rather than presented in sequence as variations on a central theme. This imposes a severe restriction on the kind of variation that can be tolerated within a category, for it requires a significant overlap between patterns within a particular category, otherwise the average of the patterns in that category would produce only a featureless blur. As a model of learning and categorization this is not necessarily a fatal problem, as long as the features represented by the F1 nodes are presumed to already be invariant to stimulus variation, i.e. that they encode significant and stable characteristics of the stimulus pattern, and therefore significantly similar patterns would be expected to have considerable overlap in their F1 feature representation. However the principle of adaptive resonance is inadequate as a general model of top-down feedback for perceptual completion across an invariance relation, because the feedback in this model can only complete a single variant of the recognized pattern in a rigid template-like manner, and that pattern is no more than a blurred together average of all of the patterns of that particular category.
Consider by contrast, the property of spatial invariance in visual recognition. A spatial pattern, for example the shape of the letter E, has very little overlap, point for point, with variations of that pattern at different orientations. And yet those rotated patterns are not perceived as approximate or imperfect letter E's with diminished recognition confidence, but each one is perceived as a perfect E shape, although it is also perceived to be rotated by some angle. If on the other hand the pattern is truly incomplete, like the shape of the letter F considered as an incomplete E shape, this does indeed register perceptually as a partial or imperfect match to the shape of the letter E. Furthermore, identification of the F shape as an incomplete E immediately highlights the exact missing segment, i.e. that segment is perceived to be missing from a very specific portion of the figure, and the exact location of that missing segment varies with the location, orientation, and scale of the F stimulus. This is a very different and more powerful kind of invariance and completion than that embodied in the ART model. And yet it is exactly the kind of invariance to stimulus variation that would be required in the F1 node representation to make the ART model at all viable as a model of recognition.
The problem can be traced to the central principle of representation in the model, which is a spatial template that is anchored to the tissue of the brain in the form of a fixed receptive field. This mechanism is therefore hard-wired to recognize only patterns that appear at exactly the same physical location as that template in the brain. The problem of invariance in the ART model becomes abundantly clear when attempting to apply its principle of invariance to the spatial variations of rotation, translation, and scale. Learning rotation invariance in the ART model for a pattern like E would be equivalent to learning the single pattern constructed by the superposition of Es at all orientations simultaneously, which creates nothing but a circular blur. And the model after training would respond more strongly to this circular blur than to any actual letter E. Adding translation and scale invariance to the system would involve learning the superposition of every rotation, translation and scale of the learned pattern across the visual field, which would produce nothing but a uniform blur.
The invariance embodied in the ART model to variations in the patterns within a particular category is not really an invariance, but is more of a blindness to those variations, because when detecting a pattern in the input, the F2 recognition mechanism cannot determine which of the allowable variations of the pattern are actually present on the input. What is required to account for invariance in perception is a system that can detect the characteristic pattern of the input despite stimulus variations, and yet have a capacity to complete a partial pattern with respect to the specific variation of the pattern present on the input field; in other words, invariance in recognition, but specification in completion. The fact that this functionality is in principle beyond the capacity of the neural receptive field was recognized already by Lashley (1942), and was a central theme of Gestalt theory.
The subjective conscious experience exhibits a unitary and integrated nature that seems fundamentally at odds with the fragmented architecture identified neurophysiologically, an issue which has come to be known as the binding problem. For the objects of perception appear to us not as an assembly of independent features, as might be suggested by a feature based representation, but as an integrated whole, with every component feature appearing in experience in the proper spatial relation to every other feature. This binding occurs across the visual modalities of color, motion, form, and stereoscopic depth, and a similar integration also occurs across the perceptual modalities of vision, hearing, and touch. The question is what kind of neurophysiological explanation could possibly offer a satisfactory account of the phenomenon of binding in perception?
One solution is to propose explicit binding connections, i.e. neurons connected across visual or sensory modalities, whose state of activation encodes the fact that the areas that they connect are currently bound in subjective experience. However this solution merely compounds the problem, for it represents two distinct entities as bound together by adding a third distinct entity. It is a declarative solution, i.e. the binding between elements is supposedly achieved by attaching a label to them that declares that those elements are now bound, instead of actually binding them in some meaningful way.
Von der Malsburg proposes that perceptual binding between cortical neurons is signalled by way of synchronous spiking, the temporal correlation hypothesis (von der Malsburg & Schneider 1986). This concept has found considerable neurophysiological support (Eckhorn et al. 1988, Engel et al. 1990, 1991a, 1991b, Gray et al. 1989, 1990, 1992, Gray & Singer 1989, Stryker 1989). However although these findings are suggestive of some significant computational function in the brain, the temporal correlation hypothesis as proposed, is little different from the binding label solution, the only difference being that the label is defined by a new channel of communication, i.e. by way of synchrony. In information theoretic terms, this is no different than saying that connected neurons posses two separate channels of communication, one to transmit feature detection, and the other to transmit binding information. The fact that one of these channels uses a synchrony code instead of a rate code sheds no light on the essence of the binding problem. Furthermore, as Shadlen & Movshon (1999) observe, the temporal binding hypothesis is not a theory about how binding is computed, but only how binding is signaled, a solution that leaves the most difficult aspect of the problem unresolved.
I propose that the only meaningful solution to the binding problem must involve a real binding, as implied by the metaphorical name. A glue that is supposed to bind two objects together would be most unsatisfactory if it merely labeled the objects as bound. The significant function of glue is to ensure that a force applied to one of the bound objects will automatically act on the other one also, to ensure that the bound objects move together through the world even when one, or both of them are being acted on by forces. In the context of visual perception, this suggests that the perceptual information represented in cortical maps must be coupled to each other with bi-directional functional connections in such a way that perceptual relations detected in one map due to one visual modality will have an immediate effect on the other maps that encode other visual modalities. The one-directional axonal transmission inherent in the concept of the neuron doctrine appears inconsistent with the immediate bi-directional relation required for perceptual binding. Even the feedback pathways between cortical areas are problematic for this function due to the time delay inherent in the concept of spike train integration across the chemical synapse, which would seem to limit the reciprocal coupling between cortical areas to those within a small number of synaptic connections. The time delays across the chemical synapse would seem to preclude the kind of integration apparent in the binding of perception and consciousness across all sensory modalities, which suggests that the entire cortex is functionally coupled to act as a single integrated unit.