Our key insight is that robust sim2real transfer can occur by learning to extract geometry, rather than relying on an active depth sensor. In SimNet, each stereo RGB image is fed into a feature extractor before being fed into approximate stereo matching. The output of a stereo cost volume network (SCVN) is a low-resolution disparity image fed in with features from the left image to a ResNet-FPN backbone and output prediction heads. The output heads predict high-level vision information such as room-level segmentation, OBBs, keypoints, and full-resolution disparity images.