Multithreading with Boost, can't access renderTarget of view port in threads

Problems building or running the engine, queries about how to use features etc.
Post Reply
Trammy
Gnoblar
Posts: 1
Joined: Mon Aug 20, 2018 2:12 pm

Multithreading with Boost, can't access renderTarget of view port in threads

Post by Trammy »

Ogre Version: : 1.9.0:
Operating System: : Linux 4.8.0-58-generic #63~16.04.1-Ubuntu x86_64.:

Hi everyone,

I'm trying to build a plugin for RViz (Ros Visualization) to recored videos directly from the windows being rendered using Ogre. If I sequentially loop over all opened scene managers and grab the current frame from the render target, everything works fine. The video is generated from the buffered images after recording is done. Yet, the frame rate drops noticeably when more than two windows are actively rendering content. See code below. I left out error handling and other unneeded lines of code for clarity.

Code: Select all

void VideoRecorderDisplay::update()
{
    if (m_recording)
    {
        grabAllFrames();
    }
}

void VideoRecorderDisplay::grabAllFrames()
{
    // TODO: start thread for each manager and parallelize
    const size_t sceneManagerSize = m_sceneManagers.size();

    for(size_t i = 0; i < sceneManagerSize; i++)
    {
        Ogre::Viewport* viewPort = m_sceneManagers[i]->getCurrentViewport();

        if(viewPort != nullptr)
        {
            Ogre::Image ogreImage = getOgreImageFromRenderTarget(viewPort->getTarget());
            m_ogreImageList.push_back(ogreImage);
        }
    }
}

Ogre::Image VideoRecorderDisplay::getOgreImageFromRenderTarget(Ogre::RenderTarget* const renderTarget)
{
    try
    {
        // create PixelBox and get data from target
        Ogre::PixelFormat pf = renderTarget->suggestPixelFormat();
        const uint32_t targetWidth = renderTarget->getWidth();
        const uint32_t targetHeight = renderTarget->getHeight();
        const size_t boxSize = targetWidth * targetHeight * Ogre::PixelUtil::getNumElemBytes(pf);

        unsigned char* ogrePixelData = new unsigned char[boxSize];
        Ogre::PixelBox pixelBox(targetWidth, targetHeight, 1, pf, ogrePixelData);

        renderTarget->copyContentsToMemory(pixelBox);

        // load data into Ogre::Image
        Ogre::Image ogreImage = Ogre::Image().loadDynamicImage(ogrePixelData, targetWidth, targetHeight, 1, pf);

        return ogreImage;
    }
}
I tried to spawn threads using the Boost library so that the frames are grabbed asynchronously, but the returned frames are always black (or some weird pixel mess). To achieve this, I split the grabAllFrames functions into separate functions for each active scene manager and attached each function to one thread.

Code: Select all

static boost::shared_ptr<boost::thread> grabberThread0;
static boost::shared_ptr<boost::thread> grabberThread1;
static boost::shared_ptr<boost::thread> grabberThread2;
static boost::shared_ptr<boost::thread> grabberThread3;

void VideoRecorderDisplay::update()
{
    if (m_recording)
    {
	grabberThread0 = boost::shared_ptr<boost::thread>(new boost::thread(boost::bind(&VideoRecorderDisplay::grabAllFramesParallelized, this, m_sceneManagers[0], boost::ref(m_dataListPerThread[0]))));

	grabberThread1 = boost::shared_ptr<boost::thread>(new boost::thread(boost::bind(&VideoRecorderDisplay::grabAllFramesParallelized, this, m_sceneManagers[1], boost::ref(m_dataListPerThread[1]))));

	grabberThread2 = boost::shared_ptr<boost::thread>(new boost::thread(boost::bind(&VideoRecorderDisplay::grabAllFramesParallelized, this, m_sceneManagers[2], boost::ref(m_dataListPerThread[2]))));
	
	grabberThread3 = boost::shared_ptr<boost::thread>(new boost::thread(boost::bind(&VideoRecorderDisplay::grabAllFramesParallelized, this, m_sceneManagers[3], boost::ref(m_dataListPerThread[3]))));	
	
	grabberThread0->join();
	grabberThread1->join();
	grabberThread2->join();
	grabberThread3->join();
    }
}

void VideoRecorderDisplay::grabAllFramesParallelized(Ogre::SceneManager* scnMngr, std::vector<Ogre::Image> dataList)
{
    Ogre::Viewport* viewPort = scnMngr->getCurrentViewport();

    if(viewPort != nullptr)
    {
        Ogre::Image ogreImage = getOgreImageFromRenderTarget(viewPort->getTarget());
        dataList.push_back(ogreImage);
    }
}
I don't know what seems to be the problem. The pointers to the scene managers are identical to the sequential version. Same for the view port. Yet somehow getTarget() does not return anything useful. When I use

Code: Select all

viewPort->getTarget()->writeContentsToFile("test.bmp")
the image is also black. So my guess it that problem is within the getTarget() function.

I'm grateful for any hint. If somebody had luck with another multithreading library, I'm also open for other solutions.

Thanks,

Trammy!
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5292
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1278
Contact:

Re: Multithreading with Boost, can't access renderTarget of view port in threads

Post by dark_sylinc »

Efficiently streaming GPU -> CPU is not an trivial task in Ogre unfortunately.

Your threading code is not working because you can't map nor call OpenGL API functions from other threads (not at least without lots of pain and headaches and exposing yourself to lots of driver bugs) which is what happens inside copyContentsToMemory.

But there are workarounds in order to get relatively efficient streaming though. The problem right now you have is that there is no streaming at all, i.e. all transfers are serialized, thus the CPU is forced to wait for the GPU to finish before it can download the results for encoding.
The solution is to have the GPU draw frame N while the CPU is downloading and encoding frame N - 2 (or N-1). This introduces some latency but greatly increases performance. But you may have to slightly modify how RViz renders to the window.

1. If you have multiple windows, check they are not all of them marked with VSync on. Otherwise it can get very slow. Either turn off VSync, or just leave VSync on for one window.

2. What works best is to buffer the streaming. Rather than directly rendering to the RenderWindow, render to a RenderTexture instead and finish your rendering by copying (with a fullscreen triangle) the RTT to the RenderWindow.
Once you have done this, you can now have 2 (or 3 for triple buffering RenderTextures, and each frame you cycle through them, while you're downloading the texture from N-1 to CPU. e.g:

Code: Select all

RenderTexture *renderTexture[3];
size_t frameIdx;
everyFrame()
{
    //Render the scene to renderTexture[frameIdx]
    renderTexture[frameIdx]->update();
    //Copy renderTexture[frameIdx] into the window (Compositors can do this via a render_quad pass)
    drawFullscreenQuad( renderTexture[frameIdx], renderWindow );
    
    //Download the contents of the oldest frame
    size_t oldestFrameIdx = (frameIdx + 1) % 3;
    renderTarget[oldestFrameIdx]->copyContentsToMemory(pixelBox);
    
    frameIdx = (frameIdx + 1 ) % 3;
]
Back in my day when I used Ogre 1.x; I would use a few compositor scripts to help me on this, particularly render_pass to copy the RTT to the Window.

3. If you want to improve this further via threading, you'll have to do what copyContentsToMemory does for you by hand: Call map() in the main thread, perform the memcpy in the worker thread and save the data (via Ogre::Image) in the worker thread. Ogre::Image is safe to use in another thread (as long as only one frame is accessing the Image at the same time) as it does not call OpenGL API functions; and when the worker thread is done, signal the main thread to call unmap(). Note: You cannot start rendering to the RenderTexture while the texture is mapped.
If this is too hard for you, perform copyContentsToMemory in the main thread, but move the Ogre::Image part to a worker thread.

Once you've done this, you'll be faced with the next bottlenecks: PCIE bandwidth, or most likely hard disk bandwidth (or PNG compression if you're saving as PNG, PNG compression is very slow!). The best of non-SSD HDDs can write up to around 120 MB/s. Just saving one raw 1920x1080 RGB picture already limits you to 20 frames per second (120x1024x1024 / 1920x1080x3).

If you want to achieve decent performance, you'll have to convert your RenderTexture to a more (lossy) compact format such as YUV420 via a shader in the GPU. See In-Game HD Video Recording using Real-Time YUYV-DXT Compression.
The shader basically needs to do RGB888 to YUV420 conversion.
You could download the RGB888 texture and convert it to YUV420 in CPU instead of GPU, but in my experience the performance difference is astronomical.

YUV420 is the native format in which most modern codecs work with, so you could hand off the raw YUV420 to ffmpeg or x264 libraries and convert it in real time.

Cheers
xrgo
OGRE Expert User
OGRE Expert User
Posts: 1148
Joined: Sat Jul 06, 2013 10:59 pm
Location: Chile
x 168

Re: Multithreading with Boost, can't access renderTarget of view port in threads

Post by xrgo »

dark_sylinc wrote: Mon Aug 20, 2018 4:59 pm If you want to achieve decent performance, you'll have to convert your RenderTexture to a more (lossy) compact format such as YUV420 via a shader in the GPU. See In-Game HD Video Recording using Real-Time YUYV-DXT Compression.
The shader basically needs to do RGB888 to YUV420 conversion.
You could download the RGB888 texture and convert it to YUV420 in CPU instead of GPU, but in my experience the performance difference is astronomical.
I am also interested in this, I do myTexture->getBuffer()->getRenderTarget()->copyContentsToMemory for capturing 360 4k video in my engine and its very slow (~8fps).
... so after converting to yuv via shader the copyContentsToMemory should automatically be faster? do I need "myTexture" to be in another format? I am currently using PF_R8G8B8, but there's no YUV PF.

thanks!
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5292
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1278
Contact:

Re: Multithreading with Boost, can't access renderTarget of view port in threads

Post by dark_sylinc »

xrgo wrote: Sun Aug 26, 2018 6:34 pm I am also interested in this, I do myTexture->getBuffer()->getRenderTarget()->copyContentsToMemory for capturing 360 4k video in my engine and its very slow (~8fps)
... so after converting to yuv via shader the copyContentsToMemory should automatically be faster? do I need "myTexture" to be in another format?
Hi! Not exactly.

The speed advantages come from:
  1. Cycling the RenderTargets. By reading from a RenderTarget the GPU is currently not using, you prevent stalls. This can easily 2x or even 3x your framerate.
  2. If you're converting in realtime (e.g. using ffmpeg or x264 lib) having the footage already in YUV420 saves you from doing the conversion in CPU, and the GPU is much faster at that.
  3. If you're not converting in realtime and disk IO was the bottleneck, YUV420 is 62.50% smaller than RGBA8888 and 50% smaller than RGB88 so yes, it will transfer faster across PCIE and it will clog the IO much less
xrgo wrote: Sun Aug 26, 2018 6:34 pmI am currently using PF_R8G8B8, but there's no YUV PF.
That's right. You have to do it yourself.
RGB888 has a triplet of RGB values per pixel (12 bytes per pixel). YUV444 also has a triplet of YUV values. Conversion between RGB888 and YUV888 is lossless. You could store YUV888 data in an RGB888 surface and it will work just fine (obviously it won't display correctly if you send it raw YUV data as if it were RGB into the monitor, you'll recognize the shapes in the image but the colours would all be messed up).

However YUV420 contains one Y per pixel, but one pair of UV every 4 pixels (every block of 2x2 pixels), making it a lossy format. If you had a source texture of 1920x1080 RGBA8888 texture, you would need one R8 1920x1080 to store Y and one RG88 960x540 to store UV. Because of this, YUV420 does not like source textures that are not multiple of 2.

One trick to perform the conversion in GPU with a pixel shader is to use one RGBA8888 960x540 texture to store the Y of the 4 pixels (that is rgba = y0, y1, y2, y3) and one RG88 960x540 to store the UV. Use an MRT rendering to output the multiple values.

The texture decl in the script:

Code: Select all

texture myYUV420Rtt target_width_scaled 0.5 target_height_scaled 0.5 PF_A8B8G8R8 PF_RG8 no_fsaa
Then use a render_quad to render the input into the myYUV420Rtt target.

Your conversion shader would look like this:

Code: Select all

float3 src00 = srcTex.Load( gl_Position.xy * 2.0f ).xyz; //x2 because you're writing to an RTT that is half the resolution
float3 src10 = srcTex.Load( gl_Position.xy * 2.0f + uint2( 1u, 0u ) ).xyz;
float3 src01 = srcTex.Load( gl_Position.xy * 2.0f + uint2( 0u, 1u ) ).xyz;
float3 src11 = srcTex.Load( gl_Position.xy * 2.0f + uint2( 1u, 1u ) ).xyz;

float3 dst00 = rgbToYUV444( src00.xyz );
float3 dst10 = rgbToYUV444( src10.xyz );
float3 dst01 = rgbToYUV444( src01.xyz );
float3 dst11 = rgbToYUV444( src11.xyz );

float2 uv20;

outPs.colour0.xyzw = float4( dst00.x, dst10.x, dst01.x, dst11.x ); //Store YYYY
outPs.colour1.x = ( dst00.y + dst10.y + dst01.y + dst11.y ) * 0.25f; //Average U for better quality (some people just store dst00.y)
outPs.colour1.y = ( dst00.z + dst10.z + dst01.z + dst11.z ) * 0.25f; //Average V for better quality (some people just store dst00.z)
Use google on how to write the rgbToYUV444 conversion. It was all simple math operations. The wikipedia article has the formula and coefficients.

When you read from CPU, just map both the RGBA8888 and the RG88 textures. This trick has the advantage that the Y plane already matches how YUV420 is layed out in memory (see the picture in https://en.wikipedia.org/wiki/YUV#Y%E2% ... conversion).

Cheers
Matias
xrgo
OGRE Expert User
OGRE Expert User
Posts: 1148
Joined: Sat Jul 06, 2013 10:59 pm
Location: Chile
x 168

Re: Multithreading with Boost, can't access renderTarget of view port in threads

Post by xrgo »

thank you so much for the detailed explanation, i am going to try implementing it
Post Reply