Understanding the lip movement and inferring the speech from it, is a notoriously difficult for the common person. The task of accurate lip-reading gets help from various cues from the speaker and its contextual or environmental setting. Every speaker has a different accent and speaking style, which can be inferred from their visual and speech features. This work aims to understand the correlation/mapping between speech and sequence of lip movement of individual speakers in an unconstrained and large vocabulary. We model the frame sequence as prior to the transformer in an auto-encoder setting and learned a joint embedding that exploits temporal properties of both audio and video using synchronization learning using deep metric learning to guide the decoder to generate speech in sync with input lip movements. The predictive posterior thus gives us the generated speech in speaker speaking style. We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks from lip movement in unconstrained natural settings. Extensive evaluation using quantitative, qualitative metrics and human evaluation show that our method outperforms on Lip2Wav Chemistry dataset(large vocabulary in unconstrained setting) by a good margin across almost all of the evaluation metric and marginally outperform the state-of-the-art on GRID dataset.