Announcement

Collapse
No announcement yet.

Making DR slaves hop onto already running frame(s)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Making DR slaves hop onto already running frame(s)

    Hi,

    I'm rendering a frame sequence 1-24 on one machine and I have Distributed Rendering enabled in my render globals.

    If the slave(s) are not up and running by the time the master machine has started rendering, it seems like that the master machine does not attempt to retry slave connection(s) between frames (or during rendering of a frame). Please see attached output for further explanation (the master machine is rendering all frames in one go).

    I think it would be great idea if the master machine would re-attempt slave connections to bring new DR slaves onto the job.

    We are now managing DR jobs through our render manager, queuing up DR jobs alongside "regular" non-DR jobs which is incredibly powerful, but this is the currently missing piece of the puzzle; getting new DR slaves to hop onto an already running job.

    I'm on V-Ray for Maya 2.30.02 - Maya 2014.

    Regards,
    Fredrik

    Code:
    ...
    
    >>> V-Ray: Starting render
    >>> [2013/Nov/6|15:40:35] V-Ray: Loading plugins from "//192.168.0.225/Pipeline/bin/vray/builds/130910_official_v23002_maya2014_x64/maya_vray/vrayplugins/vray_*.dll"
    >>> [2013/Nov/6|15:40:35] V-Ray: 103 plugin(s) loaded successfully
    >>> [2013/Nov/6|15:40:35] V-Ray: Finished loading plugins.
    >>> [2013/Nov/6|15:40:35] V-Ray: Exporting scene to V-Ray.
    >>> [2013/Nov/6|15:40:35] V-Ray: Parsing light links time  0h  0m  0.0s (0.0 s)
    >>> [2013/Nov/6|15:40:35] V-Ray: Translating scene geometry for V-Ray
    >>> [2013/Nov/6|15:40:35] V-Ray: Total time translating scene for V-Ray  0h  0m  0.0s (0.0 s)
    >>> [2013/Nov/6|15:40:35] V-Ray: Rendering.
    >>> [2013/Nov/6|15:40:35] V-Ray: Preprocessing for distributed rendering (port 20207)!
    >>> [2013/Nov/6|15:40:35] V-Ray: Pre-render export between times 1 and 24.
    >>> [2013/Nov/6|15:40:35] V-Ray: Pre-render export
    >>> [2013/Nov/6|15:40:35] V-Ray: Pre-render export done.
    
    ...
    
    
    >>> [2013/Nov/6|15:40:37] V-Ray: dr host: 192.168.0.110
    >>> [2013/Nov/6|15:40:37] V-Ray: dr host: 192.168.0.119
    >>> [2013/Nov/6|15:40:38] V-Ray warning: Could not connect to host 192.168.0.110:20207: No connection could be made because the target machine actively refused it.
    >>> Warning: Could not connect to host 192.168.0.110:20207: No connection could be made because the target machine actively refused it.
    >>> [2013/Nov/6|15:40:43] V-Ray warning: Could not connect to host 192.168.0.119:20207: Connection timeout
    >>> Warning: Could not connect to host 192.168.0.119:20207: Connection timeout
    >>> [2013/Nov/6|15:40:43] V-Ray: Using 0 hosts for distributed rendering.
    
    ...
    
    >>> [2013/Nov/6|15:40:44] V-Ray: Rendering image...
    
    ...
    
    >>> [2013/Nov/6|15:42:22] V-Ray: Rendering image...
    
    ...
    
    >>> [2013/Nov/6|15:44:12] V-Ray: Rendering image...
    
    ...
    
    >>> [2013/Nov/6|15:46:09] V-Ray: Rendering image...
    
    ...
    Best Regards,
    Fredrik

  • #2
    Actually, when a server starts up, it broacasts a UDP message, which the client may pick up and attempt to reconnect to the server. This works fine for me here, but it may be possible that the UDP message is lost along the way, perhaps filtered by a firewall or something. The message is broadcast on UDP port 20205, perhaps you can check if it is blocked.

    Best regards,
    Vlado
    I only act like I know everything, Rogers.

    Comment


    • #3
      That must be it!

      I've blocked everything on our Linux machines except for ports which I have explicitly opened.

      Right now I've opened up the following for V-Ray/DR (/etc/sysconfig/iptables):
      -A INPUT -m state --state NEW -m tcp -p tcp --dport 20207 -j ACCEPT
      -A INPUT -m state --state NEW -m udp -p udp --dport 20205 -j ACCEPT

      ... but I still can't get slaves to hop onto an existing job. Are you aware of any other ports which I should open?

      EDIT: A reboot seems to have fixed everything.
      EDIT 2: Hm, no it didn't get fixed... I was mistaken.
      Last edited by Fredrik Averpil; 06-11-2013, 02:38 PM.
      Best Regards,
      Fredrik

      Comment


      • #4
        Did you do that for the client machine as well?

        Best regards,
        Vlado
        I only act like I know everything, Rogers.

        Comment


        • #5
          The Windows7 machines successfully hops onto a running DR job without issues now. I had to enable vray.exe in the firewall (as a program rather than the individual tcp/udp ports) -- and I did that for the machine serving as master too. I'm not sure why that was a more solid solution than adding just the ports to the firewall.

          But for Linux, I still can't get slaves to work unless they are present when the master starts rendering. I wonder if there may be other ports than 20207 and 20205 (both udp and tcp) that should be opened on the Linux machines?
          I actually turned the firewall off completely on the Linux machine I'm performing tests on, and it still wouldn't join in on an ongoing DR render (it would only join if already running when the master begins rendering), so it kind of seems that something's up with the master machine that should be advertising the job. But the master machine successfully gets Windows machines to join in.

          I'm kind of out of ideas. Do you have any suggestions on anything I should try next?
          Best Regards,
          Fredrik

          Comment


          • #6
            Can you post the log from the slave running on the Linux machine?
            For more verbosity please start it with -verboseLevel=4 (see the -help for details about the option).
            V-Ray developer

            Comment


            • #7
              I've attached the logs:

              success.txt - The slave was started before the master machine begun rendering.
              fail.txt - The slave was started just after the master machine said it was unable to connect to it (because no slave was running when it tried to connect).

              This is being run through Pixar's Tractor, but even when starting the slave manually and with the firewall disabled, I'm getting the same results. If you wonder why the success.txt ends in a crash, that's because that's how I exit the slave server (killing the process).

              logs.zip
              Best Regards,
              Fredrik

              Comment


              • #8
                Are you sure the fail.txt file is correctly stored in the archive. I can see it has size of only 48.8kb and the success.txt is larger and for me the fail.txt seems truncated.

                Also this is the message you should look for in the log:
                Code:
                [2013/Nov/7|15:03:25] Broadcasting TM_SERVER_STARTED after start up
                Probably you can use tcpdump or wireshark to debug the network stack if this message is send/received correctly.
                V-Ray developer

                Comment


                • #9
                  These are just logs from the DR slave, running on Linux. The fail.txt is shorter than success.txt as success.txt reflects a slave taking on the job.

                  Did you want me to capture the logs off the master machine (running Mayabatch.exe, rendering a v-ray DR job) instead?

                  I've emailed the text file logs to support@chaosgroup.com in case that works better.

                  Edit: none of them has the "Broadcasting TM_SERVER_STARTED .... "
                  Best Regards,
                  Fredrik

                  Comment


                  • #10
                    Hm,

                    It turned out that 2.30.02 doesn't have this feature on Linux and OSX.
                    The worse thing is that we've already built the 2.40 builds and they are on the download server, so this official release won't have it, too.

                    I'll add the fix in the branch for 2.40 stable-night builds and if you wish you can take it from there.

                    Sorry for the inconvenience this problem is causing
                    V-Ray developer

                    Comment


                    • #11
                      In fact after further investigation I've found that this fix should be in the official 2.40 build.
                      V-Ray developer

                      Comment


                      • #12
                        Ah, I see. Thank you. I'm glad we sorted it out. I was afraid something was wrong on my end.

                        I wouldn't mind using a 2.40 stable build with this feature implemented. Do you dare to give me an ETA of when this may be downloadable off the nightlies server?
                        Best Regards,
                        Fredrik

                        Comment


                        • #13
                          Originally posted by t.petrov View Post
                          In fact after further investigation I've found that this fix should be in the official 2.40 build.
                          Ah! Perfect. I will download that and let you know how it goes.
                          Best Regards,
                          Fredrik

                          Comment


                          • #14
                            I am now running the 2.40.01 builds off the download area for Linux, and I can confirm that with this release Linux slaves jump onto already ongoing DR renders.

                            Many thanks for the help to get this up and running for us!
                            Best Regards,
                            Fredrik

                            Comment


                            • #15
                              It's great to know it is working now.
                              V-Ray developer

                              Comment

                              Working...
                              X