Announcement

**Sascha Selent** · 28-01-2009, 04:29 AM

We have to find the root-problem with DR!

I think it´s most important to spot the problem why nodes are dropping. What is the reason for this problem. This has to be done by the coders and I would heavily appreciate this!

All this workarounding is dealing with the symptoms, but we have to find the reason! Any ideas at ChaosGroup?

I found no workaround to get DR work properly. My actual workaround is to let spawner run as a service and automatically let the nodes restart after stopping the rendering. That ensures that a new clean instance of max is running with the spawner, but i doesn´t prevent nodes from dropping. And my experience: the harder the renderjob, the more nodes drop. That´s fatal: with simple scenes I don´t need DR, but with heavy jobs where DR is needed nodes drop.

What I´m interested in is: Vlado: do you have an DR-environment that works properly? If yes - how did you do it? What is you network-configuration?

I think it´s a network-related problem. Seems that the master at some moment loose contact to some nodes. Maybe the amount of nodes involved has also something to do with the problem. It feels like the more nodes are involved the higher is the risk of loosing them all. Maybe the whole problem has to do with the simultaniously communication of one master with many nodes.

Maybe this in relational: I also have problems with BB where sometimes the slaves loose contact to the manager. "Assuming manager is down. Application is terminating" (the erromessage for BB-server is something like that). The poorer the machine and the heavier the rendering the higher is the risk for this error. I have two types of renderslaves: Some old P4 single core 2.8 Ghz, some Core2duo 2.0 Ghz. If one frame has to render several hours the old P4-machines always fail due to this error.

Due to the importance of DR I beg for solving these problems. Especially for hard renderjobs in very high resolution it´s a deadline-killer if only one machine has to render for days while it could be done overnight with DR.

So please, do your very best at ChaosGroup! Thank you.

Sascha

**simmsimaging** · 28-01-2009, 06:56 AM

I agree with Sascha- DR is hugely important and once you get beyond single renders launched directly it runs into a ton of problems. Is there anything to be done at the Chaos Group end of things?

b

**Sascha Selent** · 02-02-2009, 06:07 AM

We have to find the root-problem with DR!

Originally posted by Sascha Selent View Post

I think it´s most important to spot the problem why nodes are dropping. What is the reason for this problem. This has to be done by the coders and I would heavily appreciate this!

All this workarounding is dealing with the symptoms, but we have to find the reason! Any ideas at ChaosGroup?

I found no workaround to get DR work properly. My actual workaround is to let spawner run as a service and automatically let the nodes restart after stopping the rendering. That ensures that a new clean instance of max is running with the spawner, but i doesn´t prevent nodes from dropping. And my experience: the harder the renderjob, the more nodes drop. That´s fatal: with simple scenes I don´t need DR, but with heavy jobs where DR is needed nodes drop.

What I´m interested in is: Vlado: do you have an DR-environment that works properly? If yes - how did you do it? What is you network-configuration?

I think it´s a network-related problem. Seems that the master at some moment loose contact to some nodes. Maybe the amount of nodes involved has also something to do with the problem. It feels like the more nodes are involved the higher is the risk of loosing them all. Maybe the whole problem has to do with the simultaniously communication of one master with many nodes.

Maybe this in relational: I also have problems with BB where sometimes the slaves loose contact to the manager. "Assuming manager is down. Application is terminating" (the erromessage for BB-server is something like that). The poorer the machine and the heavier the rendering the higher is the risk for this error. I have two types of renderslaves: Some old P4 single core 2.8 Ghz, some Core2duo 2.0 Ghz. If one frame has to render several hours the old P4-machines always fail due to this error.

Due to the importance of DR I beg for solving these problems. Especially for hard renderjobs in very high resolution it´s a deadline-killer if only one machine has to render for days while it could be done overnight with DR.

So please, do your very best at ChaosGroup! Thank you.

Sascha

Vlado, what do you say?

Sascha

Announcement

Yet another DR workaround?

Yet another DR workaround?

Comment

Comment

Comment