I recently upgraded my workstation to 2xXeons 2696v3 (36 cores total/ 72 threads) which is the equivalent of the v2699v3.
I started doing tests what OS and workflows are best to utilize 100% of the processor power for vray renderings. The OSes to go against each other are Windows 2012 Server vs Centos 7.1.
Lets start with Windows for now.
WINDOWS 2012 Server
--------------------
Windows seems to put my cpus into two groups I noticed in the Task Manager. → group0 and group1 with each 36 threads. Total it displays 72 threads though. All good there
I noticed in the couple of days of testing that somehow application processes tend to be assigned to either one of those groups to perform things. Vray in Maya seems to render by default on only one of those cpu’s at 100 percent while the other is idle. What a disappointment right there , right ?
Mr. petrov from this forum recently gave me a hint to an existing forum article which mentioned that one can utilize vrayspawner.exe under windows to utilize all the xeon cpus ( numa nodes) and get maximum performance. So I did. But I noticed some small annoyance when setting up the vrayspawner. Not sure If I am doing something wrong here… ???
Have a look at my video recording … it demonstrats the usage of the vrayspawner and my issues with it
Overall the rendering of the testframe in my video under Windows 2012 Server was around 1min 7 sec in the end. I cut the video short, cos its boring watching buckets.
Centos 7.1
--------------------
I noticed that linux is not havnig this group0 , 1 crap and lists all the cpus and threads as ONE unit! Great I thought.
I rendered the same image as above (sorry no video here for now:) and found that 100 % of the CPU are being used ( so basically all 72 Threads are at 100 %) . I thought → WOW → great! initially - but somehow the image only finshes rendering around 2 min. HOW COME ??? Any ideas why it takes so long ? I expected around 1 min …
The vrayspawner has an option to launch the rendering on specified processor group, so you don’t need to do this affinity mask manually. But AFAIK, it’s not in 3.0
HI there,
For Windows 2012 I am using 3.00.1 which is the official release.
For Linux I was using the latest nightly build from 2 days ago and the official 3.00.1 as well.
What I am really shocked to see was the linux performance of that render which I used in the video. It was 2 min on Centos VS 1 min on windows.
That kinda left me baffled…
Btw, I checked if centos was throttling the speed (read in some forums that linux likes to do it on occasions), but it wasn’t. All threads were performing at around 2.8Ghz Turbo. So yeah, no idea there. If I won’t find solution for Linux, I will stick to windows.
hmm…the thing is that its quiet random what group (either group0, or group1) maya is on when running. Its usually not fixed by default.
Only if I know that maya is on group0 I guess I could fix assign the spawner to group1 in my example.
Interesting. Generally the linux builds are a bit slower because on linux we use GCC which generates a bit slower code, but the difference should not be 2x.
Could you please do some tests so we can check what is the scaling on you machine?
For the tests you need to export the scene as vrscene file, to make it easier for testing.
The scene shouldn’t be very high resolution (720p would probably be best) and should render for at least 1-2 mins with all cores.
Then you should execute the following commands:
1. <path to vray>/vray -sceneFile=<path to your vrscene>
2. <path to vray>/vray -sceneFile=<path to your vrscene> -numThreads=36
3. <path to vray>/vray -sceneFile=<path to your vrscene> -numThreads=18
4. numactl -N 0 <path to vray>/vray -sceneFile=<path to your vrscene> -numThreads=36
5. numactl -N 0 <path to vray>/vray -sceneFile=<path to your vrscene> -numThreads=18
Run the commands several times to make sure that there is not much variation.
Post the times here.
I’m running some tests on our 32 thread machine and I’m seeing linear scaling with the current build from 16 to 32 threads.
Tomorrow I’ll run some tests on the beefiest machine we have (40 threads) and also on another 32 thread windows machine.
We think that the easiest solution for this problem will be to create a bat file that will start both Maya and V-Ray Standalone in the proper groups.
You can do this by using the start command. You should pass the /node 0 or /node 1 parameter for Maya and V-Ray Standalone respectively.
V-Ray Standalone, version 3.25.01 for x64
Build 25931 from Jun 23 2015, 02:43:50
V-Ray core version is 3.25.01
Note: I had a warning message with the newest nightly build and my vrscene.
→ warning: Unknown property “embreeHighPrec” in object “vraySettingsRaycaster”
1. vray -sceneFile=/home/user/Downloads/vrscene/myscene.vrscene
→ System Monitor CPU shows 100% utilization
[2015/Jun/23|05:10:10] Starting frame 1270.
[2015/Jun/23|05:10:10] Preparing scene for frame…: done [ 0h 0m 0.1s]
[2015/Jun/23|05:10:10] Compiling geometry…: done [ 0h 0m 0.1s]
[2015/Jun/23|05:10:10] Using embree ray tracing.
[2015/Jun/23|05:10:10] Building embree static trees took 244 milliseconds
[2015/Jun/23|05:10:10] Building embree static accelerator …: done [ 0h 0m 0.3s]
[2015/Jun/23|05:10:11] warning: Performance loss: number of rendering threads (72) is greater than number of light cache passes (64); some threads will be idle.
[2015/Jun/23|05:10:11] Tracing 1000000 image samples for light cache in 64 passes.
[2015/Jun/23|05:10:12] Building light cache…: done [ 0h 0m 1.1s]
[2015/Jun/23|05:10:13] Merging light cache passes…: done [ 0h 0m 0.3s]
[2015/Jun/23|05:10:13] Light cache contains 6875 samples.
[2015/Jun/23|05:10:13] Light cache takes 5.2 MB.
[2015/Jun/23|05:10:13] Prefiltering light cache…: done [ 0h 0m 0.0s]
[2015/Jun/23|05:10:13] Average rays per light cache sample: 122.46 (min 1, max 470)
[2015/Jun/23|05:11:59] Rendering image…: done [ 0h 1m 46.6s]
[2015/Jun/23|05:12:00] Number of raycasts: 1493356092 (89.01 per pixel)
[2015/Jun/23|05:12:00] Camera rays: 533206645 (31.78 per pixel)
[2015/Jun/23|05:12:00] Shadow rays: 539686353 (32.17 per pixel)
[2015/Jun/23|05:12:00] GI rays: 436907880 (26.04 per pixel)
[2015/Jun/23|05:12:00] Reflection rays: 0 (0.00 per pixel)
[2015/Jun/23|05:12:00] Refraction rays: 0 (0.00 per pixel)
[2015/Jun/23|05:12:00] Unshaded rays: 0 (0.00 per pixel)
[2015/Jun/23|05:12:00] Light cache utilization: 78.48%
[2015/Jun/23|05:12:00] Number of light evaluations: 456672841 (27.22 per pixel)
[2015/Jun/23|05:12:00] Number of intersectable primitives: 327819
[2015/Jun/23|05:12:00] SD triangles: 327818
[2015/Jun/23|05:12:00] MB triangles: 0
[2015/Jun/23|05:12:00] Static primitives: 0
[2015/Jun/23|05:12:00] Moving primitives: 0
[2015/Jun/23|05:12:00] Infinite primitives: 1
[2015/Jun/23|05:12:00] Static hair segments: 0
[2015/Jun/23|05:12:00] Moving hair segments: 0
[2015/Jun/23|05:12:26] Successfully written image file “/home/user/maya/projects/default/images/tmp/testrender2015.png”
[2015/Jun/23|05:12:26] Frame took 135.86 s.
Another run 143sec
2. vray -sceneFile=/home/user/Downloads/vrscene/myscene.vrscene -numThreads=36
→ System Monitor CPU shows 50% utilization
[2015/Jun/23|05:13:43] Starting frame 1270.
[2015/Jun/23|05:13:43] Preparing scene for frame…: done [ 0h 0m 0.1s]
[2015/Jun/23|05:13:43] Compiling geometry…: done [ 0h 0m 0.1s]
[2015/Jun/23|05:13:43] Using embree ray tracing.
[2015/Jun/23|05:13:43] Building embree static trees took 308 milliseconds
[2015/Jun/23|05:13:43] Building embree static accelerator …: done [ 0h 0m 0.3s]
[2015/Jun/23|05:13:44] Tracing 1000000 image samples for light cache in 64 passes.
[2015/Jun/23|05:13:45] Building light cache…: done [ 0h 0m 1.1s]
[2015/Jun/23|05:13:45] Merging light cache passes…: done [ 0h 0m 0.3s]
[2015/Jun/23|05:13:45] Light cache contains 6875 samples.
[2015/Jun/23|05:13:45] Light cache takes 5.2 MB.
[2015/Jun/23|05:13:45] Prefiltering light cache…: done [ 0h 0m 0.0s]
[2015/Jun/23|05:13:45] Average rays per light cache sample: 122.46 (min 1, max 470)
[2015/Jun/23|05:15:26] Rendering image…: done [ 0h 1m 40.6s]
[2015/Jun/23|05:15:27] Number of raycasts: 1492798547 (88.98 per pixel)
[2015/Jun/23|05:15:27] Camera rays: 532649081 (31.75 per pixel)
[2015/Jun/23|05:15:27] Shadow rays: 539686367 (32.17 per pixel)
[2015/Jun/23|05:15:27] GI rays: 436907889 (26.04 per pixel)
[2015/Jun/23|05:15:27] Reflection rays: 0 (0.00 per pixel)
[2015/Jun/23|05:15:27] Refraction rays: 0 (0.00 per pixel)
[2015/Jun/23|05:15:27] Unshaded rays: 0 (0.00 per pixel)
[2015/Jun/23|05:15:27] Light cache utilization: 78.48%
[2015/Jun/23|05:15:27] Number of light evaluations: 456672843 (27.22 per pixel)
[2015/Jun/23|05:15:27] Number of intersectable primitives: 327819
[2015/Jun/23|05:15:27] SD triangles: 327818
[2015/Jun/23|05:15:27] MB triangles: 0
[2015/Jun/23|05:15:27] Static primitives: 0
[2015/Jun/23|05:15:27] Moving primitives: 0
[2015/Jun/23|05:15:27] Infinite primitives: 1
[2015/Jun/23|05:15:27] Static hair segments: 0
[2015/Jun/23|05:15:27] Moving hair segments: 0
[2015/Jun/23|05:15:55] Successfully written image file “/home/user/maya/projects/default/images/tmp/testrender2015.png”
[2015/Jun/23|05:15:55] Frame took 132.21 s.
Another run 142.52 sec
3. vray -sceneFile=/home/user/Downloads/vrscene/myscene.vrscene -numThreads=18
→ System Monitor CPU shows 25% utilization
[2015/Jun/23|05:19:46] Starting frame 1270.
[2015/Jun/23|05:19:46] Preparing scene for frame…: done [ 0h 0m 0.1s]
[2015/Jun/23|05:19:46] Compiling geometry…: done [ 0h 0m 0.1s]
[2015/Jun/23|05:19:46] Using embree ray tracing.
[2015/Jun/23|05:19:47] Building embree static trees took 267 milliseconds
[2015/Jun/23|05:19:47] Building embree static accelerator …: done [ 0h 0m 0.3s]
[2015/Jun/23|05:19:47] Tracing 1000000 image samples for light cache in 64 passes.
[2015/Jun/23|05:19:49] Building light cache…: done [ 0h 0m 2.1s]
[2015/Jun/23|05:19:50] Merging light cache passes…: done [ 0h 0m 0.2s]
[2015/Jun/23|05:19:50] Light cache contains 6875 samples.
[2015/Jun/23|05:19:50] Light cache takes 5.2 MB.
[2015/Jun/23|05:19:50] Prefiltering light cache…: done [ 0h 0m 0.0s]
[2015/Jun/23|05:19:50] Average rays per light cache sample: 122.46 (min 1, max 470)
[2015/Jun/23|05:22:18] Rendering image…: done [ 0h 2m 28.2s]
[2015/Jun/23|05:22:18] Number of raycasts: 1492519827 (88.96 per pixel)
[2015/Jun/23|05:22:18] Camera rays: 532370311 (31.73 per pixel)
[2015/Jun/23|05:22:18] Shadow rays: 539686406 (32.17 per pixel)
[2015/Jun/23|05:22:18] GI rays: 436907901 (26.04 per pixel)
[2015/Jun/23|05:22:18] Reflection rays: 0 (0.00 per pixel)
[2015/Jun/23|05:22:18] Refraction rays: 0 (0.00 per pixel)
[2015/Jun/23|05:22:18] Unshaded rays: 0 (0.00 per pixel)
[2015/Jun/23|05:22:18] Light cache utilization: 78.48%
[2015/Jun/23|05:22:18] Number of light evaluations: 456672859 (27.22 per pixel)
[2015/Jun/23|05:22:18] Number of intersectable primitives: 327819
[2015/Jun/23|05:22:18] SD triangles: 327818
[2015/Jun/23|05:22:18] MB triangles: 0
[2015/Jun/23|05:22:18] Static primitives: 0
[2015/Jun/23|05:22:18] Moving primitives: 0
[2015/Jun/23|05:22:18] Infinite primitives: 1
[2015/Jun/23|05:22:18] Static hair segments: 0
[2015/Jun/23|05:22:18] Moving hair segments: 0
[2015/Jun/23|05:22:46] Successfully written image file “/home/user/maya/projects/default/images/tmp/testrender2015.png”
[2015/Jun/23|05:22:46] Frame took 179.42 s.
Another run 179.60 sec
I put the values into Excel.
It looks like the threads don’t scale well beyond 32 threads. Actually they become worse! It takes longer to render on 72 threads compared to 36 threads
How difficult do you think is it to fix the scaling issue beyond 32 threads ? Do you guys need a machine for that ? I am willing to buy you guys a machine ..
Interesting.
Yeah, I can see that it scales fine with your 32 and 40 threads.
The scaling seems to drop beyond that for some reason. Maybe my linux setup is not optimized.
But if you compare my 32 threads time to your 40 threads time they come quiet close. Its only after 40 threads it goes into total bad values.
Whats interesting as well is if you compare your 16 threads vs my 18 threads. There is a huugee gap!!!
DO you think the linux distribution makes difference ?
Do you have a scene for me maybe I can test?
I’ve not installed the OSes on both machines, so I don’t know if there is something special that needs to be done.
We have no CentOS 7.x machines around, so I cannot verify that it is not related to the distro/kernel.
Both machines I’ve tested are CentOS 6.x.
Can you run one last test with -display=0 and all threads?
I was able to push it better by another 10 sec with the DR setup when using 36 threads at a time. $ numactl -N 0 ./vray -sceneFile=/home/user/Downloads/vrscene/myscene.vrscene -distributed=1 -renderHost=localhost -numThreads=36 -display=0 Frame took 84.64 sec (display=0) vs 95.04 sec (display =1)
It looks like the display doesn’t work as promised with more than 40 Threads.
I ran three times with ‘all threads’ on -display=0 with results of 138.30 sec, 141.20 sec and 141.96 sec.
I ran again -display=1 and I got 141.10 sec (yesterday 135.86). So basically it makes virtually no difference.
$ ./vray -sceneFile=/home/user/Downloads/vrscene/myscene.vrscene
Frame took 138.30 s, 141.20 s, 141.96 s