I hardly ever thinK about the base window system in my rasterization application. Now that I want to investigate finding fps of a full screen application, I have to dig into the windowing.
I use Qt to create an aplication window. Qt probably calls Windows window manager (DWM 1.2) api which composes with the desktop and other app windows to a framebuffer calls Windows DXGI base window system api to blit the framebuffer to the gpu. The main idea at every level is you copy from one framebuffer to the back buffer of another. It's no wonder there is so much latency in today's non-vr games because of the 3 levels of framebuffer just to present a finished frame.
I was trying to figure out how to measure fps. Tracking the times for framebuffer swap at one of the 3 levels of framebuffers is sufficient to find the average frame times of frames over a second. Common fps counters like Fraps today tracks the times for buffer swap between first 2 levels. (1) However, this may not be a good indicator of performance. In non-vsync mode, the front buffer may be swapped out before the full refresh period but the next frame is displayed for more than 1 full refresh period. In Vsync mode, the second frame is displayed after the first frame is fully displayed but stays displayed for more than 1 monitor refresh. Graphic driver teams and I'm game engine programmers are trying to optimize against this exact case. I can speculate the changes could include the final glDraw* call for the frame submit the gl calls to command buffer even if it's lower than the ideal amount of gl* calls, so the current frame can be finished asap instead of waiting on partially the next frame's gl* calls. "Issuing a command to the internal rendering command buffer can be a fairly slow process, due to a CPU transition (on x86 hardware) out of protected mode and into unprotected mode. This transition eats up a lot of cycles, so if the internal driver can store 30 rendering commands and then issue all of them with only one transition, this is faster than making one transition for each of the 30 rendering calls." (2) In other words The delay to go from user mode to kernel mode so the delay is shared amongst more gl* commands.
So the best measure of performance is tracking the times of hatdware framebuffer swap at the end of the gpu pipeline where the content to the display is read from. If the gpu architecture can obtain this information in its memory, then the software team can come up with heuristics that adjusts the number of frames queued and other settings for maximum performance. Fcat, Currently, to have compatibility with all graphic cards, the frame data to the display is analyzed to determine the number of gpu frames are in one display frames and time until the next new frame.
(1) http://www.anandtech.com/show/6862/fcat-the-evolution-of-frame-interval-benchmarking-part-1
(2) https://www.opengl.org/wiki/Synchronization