Many things can go wrong when deploying a large cluster of nodes. We thought it would be helpful for some of you to include some useful tips and tricks for deploying 20+ node clusters when some of the nodes may cause you problems.
LEASE MORE NODES THAN YOU NEED
We never thought we advocate this, but in a bare metal installation where nodes are subject to a lot of wear and tear node failures can happen more often -- the larger the number of nodes you ask for the greater the probability that some of them might fail -- so in this case asking for a little bit more can be justified. The advantage of that is that Chameleon will attempt to launch all the nodes from your lease so that if one or two fail you may still end up with the number you requested. For instance, say you leased 32 nodes in order to launch a 30 node cluster. If one of your nodes is faulty then Chameleon will skip over that node and attempt to launch the instance on the remaining nodes. If you had only leased 30 nodes, then your only alternative would be to settle for a smaller cluster or file a ticket to have one of our team members replace the faulty node in your lease. However, please be cognizant that you are sharing resources with other users: do not lease unreasonably more nodes than you need to ensure can launch your full cluster.
STAGGER YOUR DEPLOYMENTS
You don’t have to launch a cluster all in one go. So long as each instance is on the same network (e.g. sharednet1), your nodes can communicate with each other via their private network (Don’t give them all floating ip addresses!). Instead of launching one 30 node cluster, you can launch three 10 node clusters and modify your control node. This both manages network congestion that is a frequent cause of failures and allows you to debug your deployment bit by bit as needed.
Chameleon’s Orchestration service (Heat) allows you to define and launch all the resources you need for your experiment in one HOT Template. A Heat Orchestration Template (HOT) is a yaml file that allows to you to describe your entire experiment -- from something as simple as launching multiple nodes to as complex as an experiment involving more complex network configurations -- and then launch it with one click. Rather than launching multiple nodes and reconfiguring them each time your 7-day lease ends, you can simply upload your template by clicking the `+Launch Stack` button in the Stacks section of Chameleon’s Orchestration service in the GUI and the experiment will be launched automatically when the experiment starts.
The orchestrator can also be used to take advantage of the tips mentioned above. If your reservation is larger than the cluster defined in your HOT template, it will retry instances on alternative nodes, or you can stagger your cluster across multiple stacks. Working with yaml files can be frustrating and difficult to debug, but defining your experiment configuration to launch in one-click can save you considerable time in the long run and make your experiments more easily reproducible.
If all else fails, we are here for you. Submit a ticket via the Help Desk and we will do our best to get your experiment on its way quickly.