Neural Systematic Binder

Gautam Singh¹ Yeongbin Kim² Sungjin Ahn²

¹Rutgers University ²KAIST

ICLR 2023

The key to high-level cognition is believed to be the ability to systematically manipulate and compose knowledge pieces. While token-like structured knowledge representations are naturally provided in text, it is elusive how to obtain them for unstructured modalities such as scene images. In this paper, we propose a neural mechanism called Neural Systematic Binder or SysBinder for constructing a novel structured representation called Block-Slot Representation. In Block-Slot Representation, object-centric representations known as slots are constructed by composing a set of independent factor modules called blocks, to facilitate systematic generalization. SysBinder obtains this structure in an unsupervised way by alternatingly applying two different binding principles: spatial binding for spatial modularity across the full scene and factor binding for factor modularity within an object. SysBinder is a simple, deterministic, and general-purpose layer that can be applied as a drop-in module in any arbitrary neural network and on any modality. In experiments, we find that SysBinder provides significantly better factor disentanglement within the slots than the conventional object-centric methods, including, for the first time, in visually complex scene images such as CLEVR-Tex. Furthermore, we demonstrate factor-level systematicity in controlled scene generation by decoding unseen factor combinations.

Figure 1: Overview. Left: We propose a novel binding mechanism, Neural Systematic Binder, that represents an object as a slot constructed by concatenating multi-dimensional factor representations called blocks. Without any supervision, each block learns to represent a specific factor of the object such as color, shape, or position. Right: Neural Systematic Binder works by combining two binding principles: spatial binding and factor binding. In spatial binding, the slots undergo a competition for each input feature followed by iterative refinement, similar to Slot Attention. In factor binding, unlike Slot Attention, for each slot, the bottom-up information from the attended input features is split and routed to M independent block refinement pathways. Importantly, each pathway provides a representation bottleneck by performing dot-product attention on a memory of learned prototypes.

Results

Slot Learning and Intra-Slot Disentanglement

Key Takeaways

Significantly better disentanglement of object properties within the slot compared to the baselines.
Shows property disentanglement without VAE-based modeling or auxiliary losses that encourage disentanglement.
Multi-dimensional blocks provide a more flexible representation of a factor than single-dimensional representations in conventional methods.
Property disentanglement working for the first time in complex textured images such as CLEVR-Tex.
Property disentanglement working despite having few blocks per slot, making slots more interpretable than previous methods.

Towards Interpretable Slots

Figure 2: Visualization of within-slot feature importance matrix with respect to the true factors of object variation. We visualize the feature importance matrices obtained from a gradient-boosting classifier that tries to predict various object properties from the slots. The rows correspond to an object property such as shape, color, or position. For our model, the columns correspond to blocks. For the baselines, the columns correspond to the individual dimensions of the slots as in these models, the property-level decomposition is expected to happen at the level of individual dimensions. We note that with regard to object-properties, the block-slots produced by our model are more interpretable than the slots produced by the models.

Emergence of Abstract Concepts in Blocks

Figure 3: Visualization of object clusters obtained by applying K-means on specific blocks in CLEVR-Tex, CLEVR-Easy, and CLEVR-Hard. Each block learns to specialize to a specific object property e.g., shape or color, abstracting away the remaining properties.

Object Property Swapping

Figure 4: Visualization of Property-level Scene Manipulation in CLEVR-Easy, CLEVR-Hard and CLEVR-Tex. We manipulate a given scene by swapping a specific property of two objects in the scene. For a given scene, we choose two objects and swap their shape, color, position, material, and size. White arrows shown on the input images point to two objects in each scene whose properties are swapped.

Property-level Novel Scene Composition

CLEVR-Hard

Figure 5: Compositional Scene Generation in CLEVR-Hard. We are given 8 input images from which we extract block-slots. Using these extracted block-slots, we compose new objects by combining object properties into novel combinations. By decoding these composed novel slots, we show that we can generate novel scenes.

CLEVR-Easy

Figure 6: Compositional Scene Generation in CLEVR-Easy. We are given 8 input images from which we extract block-slots. Using these extracted block-slots, we compose new objects by combining object properties into novel combinations. By decoding these composed novel slots, we show that we can generate novel scenes.

Page updated

Google Sites

Report abuse