Aperture is, essentially, a round function from which one could build a synchronous stream cipher and hash function.  It was not submitted for SHA-3 and probably will not be submitted for anything ever.  I am not a professional cryptographer or a mathematician; you should assume this is a terribly insecure design.  I'm posting it here purely to satisfy anyone's curiosity and for the sheer expressive thrill of it.

The main attraction is the optimized code for Aperture.  It uses SSE2 intrinsics, so each variable is actually a vector of four 32-bit words; if you're unfamiliar with SSE2, see documentation from Intel or Microsoft.

The design is heavily influenced by Daniel Bernstein's Salsa20 and, to some extent, his ChaCha variant of the function.

Important differences from ChaCha are:

  • It uses a 2048-bit block composed of four 512-bit blocks which are each composed of 16 32-bit words that can be seen as a 4x4 matrix.  The inter-round permutation is somewhat different from Salsa20's, in order to propagate changes efficiently across the whole 2048-bit block:
    • After even rounds, the four rows rows of each block are shifted right by 0/1/2/3 words, as in ChaCha.  Unlike in ChaCha, they are never shifted back.
    • After odd rounds, the 2048-bit block is treated as a 4x4 matrix of 128-bit quadwords (each quadword corresponds to one row of a 512-bit block); the four rows in the matrix are right-shifted by 0/1/2/3 quadwords; that lets changes propagate across blocks.
  • The quarter-round function is somewhat different: d = ((d + a) ^ c) >>> constant; (a, b, c, d) = (b, c, d, a);.  Four quarter-rounds are applied to each of the sixteen columns in the block in parallel.  (This parallelism may improve efficiency in hardware and on future processors.  SSE2 can apply the column-round transform to four columns at once; Intel's AVX technology, due out in a few years, will allow operation on eight columns at once.)
  • 128 bits of message are injected into the block every two rounds, and if Aperture is being used as a synchronous stream cipher, 128 bits of keystream are extracted from the state as well.
  • A counter is XORed into part of the block every round.
The idea of injecting small amounts of message frequently -- instead of accepting, say, 512 bits of input then running eight rounds of the cipher -- is infleunced by Alex Biryukov's idea of leak extraction as a way to create a stream cipher from a block cipher, and, one hopes, improve the level of security per unit of work done.  (Of course, Biryukov notes that the cryptanalyst must carefully choose the right leak, and I may have chosen the wrong one.)  The work done on cryptographic sponges is also relevant, but it's important to note that I don't claim that two rounds of Aperture is a pseudorandom function.

I only partly defined, and didn't implement, the mode of operation -- the "paperwork" surrounding the round function that would be needed to turn it into a real hash function and stream cipher:
  • A random-looking initial state would need to be provided.  When used as a stream cipher, initialization would consist of hashing the key followed by some all-zero blocks.
  • The message could be padded in the conventional Merkle-Damgard way, perhaps with a configuration bitstring inserted at the end as well.
  • Finalization would consist of hashing some all-zero blocks after the message and padding have been hashed, then using a truncated version of the state as the hash value.
  • There's no fundamental reason an interleaved mode couldn't be created for greater parallelism, though a tree mode would be more complicated. 
Finally, I want to emphasize that: 
  • I make no claims of security; it's just an idea and even I haven't analyzed it that much.  
  • No one named here should be blamed for any of my mistakes.