In the newer code (Java and JavaScript) it is perhaps easier to see that the output of each layer of the network is simply element-wise multiplied with a parameter vector. The elements of the parameter vector are selected by binary decisions. Then the fast Walsh Hadamard transform (WHT) is applied (patterns of addition and subtraction.)
There is an outline statistical argument that the WHT applied to the element-wise product is capable of approximating any linear mapping.
Since the parameters are chosen by binary decisions there are a very large number of potential linear mappings.
For a SwitchNet neural network whose internal layer width is 1024 there are 2 to the power of 1024 potential mappings.