Announcement

Collapse
No announcement yet.

Swarm: Too many master nodes and black buckets

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Swarm: Too many master nodes and black buckets

    We're using Swarm in our office to render on 7 dedicated servers from VRay for Rhino 6 (and occasionally Revit). On the most recent project, we've been having a great deal of trouble with Swarm nodes leaving black buckets.

    Looking at the Swarm UII, I can see that the nodes are confused about who is in charge - most of the nodes think they are the master as shown in the first screenshot. The Swarm service is started and stopped by Deadline on the machines (per this Thinkbox article), and I suspect this is what's making them confused:

    When going to any of the starred nodes' Swarm config page (except 1) - they seem to think they are in a Swarm of one machine, as seen in the second screenshot.

    I tried setting the swarm master node manually, but when I do that machines only seem to be able to see themselves - they don't even see the master (VRay Swarm is allowed in firewall, and machines can ping each other).

    Is there some way of manually purging whatever data these nodes have cached so they can hold a real election and all get on the same page again?
    Attached Files

  • #2
    Update: I think I fixed it.

    I found a forum post (but NOT documentation) that a specified master node needs to have its own configuration set to "Auto-detect". I think this qualifies as a bug, but at the very least should be documented and probably should be caught by the UI and explained to the user.

    Once I went to every single node and set them to a specific machine to be the master, they all agreed about what to do and the black buckets stopped.

    I do think this problem points to a flaw in how Swarm nodes conduct elections when in auto-detect mode, however.

    Comment


    • #3
      Thank you for the note, I'll review the documentation and add some useful info to it. In your situation, several things might have caused an issue - a node in the network with network issues can mess the auto - detect mechanism. Also - nodes in different networks are not visible by default since the TTL for the UDP packets is set to 1 by default (can be adjusted).
      Ivan Slavchev

      SysOps

      Chaos Group

      Comment


      • #4
        When the Swarm service stopped on the Master machine yesterday (not sure why; I was out of the office); everything went right back to all the servers electing themselves to be in charge again.

        Originally posted by ivan.slavchev View Post
        a node in the network with network issues can mess the auto - detect mechanism.
        How would one go about diagnosing this? As far as I know, everything is working as expected on our LAN.

        Some of the servers have multiple NICs, each with its own IP. Would that confuse Swarm?

        Originally posted by ivan.slavchev View Post
        Also - nodes in different networks are not visible by default since the TTL for the UDP packets is set to 1 by default (can be adjusted).
        All machines are in the same subnet. Does TTL need to be set to 2 anyway?

        Comment


        • #5
          One more question - a machine I am trying to use as a non-rendering Master node has two NICs on different subnets. In its swarm configuration it lists the wrong IP as its address, and it appears the other nodes can't see it. How can I force Swarm to use a particular local IP to listen/talk on?

          Comment


          • #6
            Originally posted by jskt View Post
            When the Swarm service stopped on the Master machine yesterday (not sure why; I was out of the office); everything went right back to all the servers electing themselves to be in charge again.

            How would one go about diagnosing this?
            SWARM has a log located in "%APPDATA%\Chaos Group\vray-swarm\work\vray-swarm\vray-swarm.log" . If the service is running with a local system account the log will be located in C:\Windows\System32\config\systemprofile\AppData\R oaming\Chaos Group\vray-swarm\work\vray-swarm
            Check for error messages there. Also - the above means that if you change the user running the service SWARM will create a new folder in the new user's %APPDATA% with default settings e.g. Auto discovery instead of a predefined master node.

            Originally posted by jskt
            Some of the servers have multiple NICs, each with its own IP. Would that confuse Swarm?
            It might, SWARM by default will pick the interface with the highest metric - e.g. the first one listed in ipconfig /all output

            Originally posted by jskt
            All machines are on the same subnet. Does TTL need to be set to 2 anyway?
            No, you can leave it to 1

            Originally posted by jskt
            A machine I am trying to use as a non-rendering Master node has two NICs on different subnets. In its swarm configuration it lists the wrong IP as its address, and it appears the other nodes can't see it. How can I force Swarm to use a particular local IP to listen/talk on?
            You can set it in "C:\Program Files\Chaos Group\V-Ray Swarm\config.yaml" file on the machine. To do that - add interface: 'Interface Name' as shown below, the word "interface" should start on the same line as "port" above it. After setting the value - restart V-Ray Swarm service.

            Code:
            network:
              port: 24267
            [B]  interface: 'Ethernet0'[/B]
            discover:
              autoDiscover: true
              masterNodes:
                - ""
                - ""
                - ""
              ttl: 1
            logger:
              level: info
            firstRun:
              configFilePath:
            Ivan Slavchev

            SysOps

            Chaos Group

            Comment


            • #7
              Thanks for the detailed reply. It's now working.

              Can you verify that both the primary and fallback master nodes should be set to Auto-detect?

              Also, swarm was not writing logs to any of the places you listed - it was running as LocalSystem and the folder you mentioned did not exist, and when I cahnged it to a domain user with admin rights, the folder was not created in their appdata.

              For the sanity of future searchers:

              1. The formatting of the config file is extremely picky - each entry must have the proper number of leading and trailing spaces to create indentation as shown in Ivan's example.
              2. When you open the file in Notepad it will appear as one long line - use Notepad++ to see line breaks
              3. If your interface name has spaces in it, you need double quotes
              4. The CMD command to get the correct interface name in a nice copy-pasteable format is: getmac /FO csv /V
              Last edited by jskt; 20-09-2018, 01:17 PM.

              Comment


              • #8
                Originally posted by jskt View Post
                Thanks for the detailed reply. It's now working.

                Can you verify that both the primary and fallback master nodes should be set to Auto-detect?
                Auto-detect is recommended in case there aren't network issues - for example a forbidden UDP multicasts. If there are - you can set the Master nodes to be masters one to another. The most important thing is NOT to assign a node to be a master to itself.

                Originally posted by jskt View Post
                Also, swarm was not writing logs to any of the places you listed - it was running as LocalSystem and the folder you mentioned did not exist, and when I changed it to a domain user with admin rights, the folder was not created in their appdata.
                Usually a restart of the service fixes this. The log start to be written when a certain instance gets a build (first time a render job is submitted).

                Originally posted by jskt View Post
                1. The formatting of the config file is extremely picky - each entry must have the proper number of leading and trailing spaces to create indentation as shown in Ivan's example.
                Unfortunately, .yml files are that sensitive

                Originally posted by jskt View Post
                2. When you open the file in Notepad it will appear as one long line - use Notepad++ to see line breaks
                WordPad might work too, but Notepad++ is much nicer to work with.

                Originally posted by jskt View Post
                3. If your interface name has spaces in it, you need double quotes
                That's a nice addition.

                Originally posted by jskt View Post
                4. The CMD command to get the correct interface name in a nice copy-pasteable format is: getmac /FO csv /V
                This command in Powershell does a nice job too
                Code:
                Get-Netadapter

                As a whole - thanks for initiating the discussion. I'm thinking of wrap that up and placing it in the docs, hopefully there will be enaugh time next week.
                Ivan Slavchev

                SysOps

                Chaos Group

                Comment


                • #9
                  Originally posted by ivan.slavchev View Post
                  Usually a restart of the service fixes this. The log start to be written when a certain instance gets a build (first time a render job is submitted).
                  In the case where you want a machine to be a non-rendering Swarm Master (for example if it's your Deadline repository machine, your license server, your PDC, your Citrix server, etc.), this means logs will never be generated.

                  More broadly, if network issues prevent renders from starting, it sounds like logs won't ever be generated.

                  Can one add an argument to the service startup to make it start logging immediately?

                  Comment


                  • #10
                    In case the Master node doesn't render at all - you can check C:\Windows\System32\config\systemprofile\AppData\R oaming\Chaos Group\vray-swarm\work\vray-swarm\service-controller\service-controller.log"
                    It logs only the service events, not the V-Ray's instance behavior.
                    Ivan Slavchev

                    SysOps

                    Chaos Group

                    Comment

                    Working...
                    X