How MSAA works:
Imagine you have 1920x1080 4xMSAA.
That means internally you're rendering at 3840x2160 and then doing a downsampling.
That's it. You're done.
OK a bit more info: What I just described is called SSAA (Super Sampling Antialiasing).
MSAA is just SSAA but with a lot of shortcuts: MSAA focuses on the fact that only the triangle borders need antialiasing; so the GPU detects if we're at a border.
If we're not, then it runs the pixel shader 1x and then performs a broadcast to fill all 4 pixels with the same colour (it actually uses compression, so it uses a hidden 2-bit variable to indicate which of the 4 subsamples of the 3840x2160 texture are set to the same colour).
When we're at a border, it runs the pixel shader up to 4 times.
And that's also why resource transitions (Vulkan concept, in D3D11 this happens automatically behind the scene but the driver needs to do heavy tracking) are needed and so important for MSAA. When you want to use the MSAA, those hidden 2-bit variable optimizations need preparation (or unpacking, if you want to read what you just rendered to).
And some GPUs are more affected than others by wrong Vulkan barriers on MSAA: Because some GPUs don't need to unpack that 2 bit counter as they can unpack on the fly when the shader reads.
But on other GPUs, they must be unpacked first.
I think I may be wrong about the counter being 2-bit; maybe it's 3-bit. Well, that's an implementation detail. What matters is that there is a hidden mask to speed up rendering of non-borders which prevents copying the same colour 4 times.
This is also why some scenes with MSAA can be hit a lot more than others: If you use grass blades (not textured billboard, I mean detailed geometry) you will notice your framerate plummets when looking at the grass.
That's because the grass blades are thin and the screen will be full of borders. Thus MSAA turns into almost SSAA.
The "resolve step" is just downsampling the 4x data into a single image.