Vision and force play an essential role in contact-rich robotic manipulation tasks. Current methods focus on developing the feedback control around a single modality while underrating the synergy of the sensors. Coordinating multimodality from perception to control poses significant challenges, mainly due to the disparate characteristics of vision and force, as well as the interconnectedness between perception and control. This paper proposes a novel multimodality integration mechanism from perception to control on a precise assembly task. To begin with, we employ a self-supervised encoder to extract multi-view visual features and aggregate historical force signals to represent force features. Next, vision and force are fused by curriculum reinforcement learning mapping the multimodal features to the motions of the end-effector executed by a unified motion/force/impedance controller. Experiments indicate that robots with the control scheme could assemble pegs with 0.1 mm clearance in simulation. Furthermore, the system is generalizable to various initial configurations and unseen shapes, and it can be robustly transferred from simulation to reality without fine-tuning.