Here we provide additional details on the robots as well as the framework we use for all our experiments.
The robotic platforms we use are DJI Robomaster S1 platforms. The robots have an omnidirectional mecanum-wheel drive, and can track fixed-frame position and velocity state references using an onboard control stack. The robots have a radius of approximately 25cm. We operate within a 3.8m x 3.8m subset of a larger 6 x 4 motion-capture arena. We use the default North-East-Down frame convention (with the Down component always zero for ground robots), which is why some of the text prompts contain cardinal directions (e.g., north edge) in them.
The actions generated by our trained policies are fixed-frame velocity targets for the robot, commanded at 1Hz, and are tracked by the onboard controller at 50Hz. Note that since our policies take just a few milliseconds to generate actions, we can potentially run them at substantially higher rates if necessary. However, with discretized actions, we find that an action frequency of 1 Hertz is sufficient for the navigation tasks we consider.
Our control stack, motion-capture setup and the policy execution are all wrapped in a ROS2 ecosystem, and the exact framework is also utilized for collecting real-world datasets.
We provide an arena to the agents, outside of which is considered a collision. During training, we approximate the collision radius of each agent as 40cm. During deployment, our actions come from a Boltzmann policy with a temperature of τ = 0.01. Each agent i shares the same parameters for π. We update the Q function using a Polyak soft update.