Balance guests across a PVE cluster

Prerequisites

  1. You have an Ansible controller setup.
  2. All PVE nodes are setup in inventory with ansible_connection=local specified. (We'll be using the controller to carry out
  3. You know how to create PVE users, set permissions and generate API keys.

The Why?

Sysadmins are lazy creatures. We're also overworked creatures with a backlog of tasks longer than a CVS receipt. Hence we're notoriously terrible about maintaining static DNS entries and staying on top of documentation. As we shoulder more and more responsibilities like a proverbial pack-mule, automation becomes increasingly crucial. In my case, I've been working on deployment playbooks for deploying guests on my HA-5 PVE cluster. When it came to balancing guests across the nodes, I wanted something a little "smarter" than simply round-robin placement. So for this post I'm going to show you how I designed my placement engine that takes into account resource availability.

What exactly is a placement engine?

Say you're creating a new VPS on an IaaS platform like DigitalOcean. When you provision a new droplet, DigitalOcean’s backend needs to decide which physical server will host your instance. The system responsible for making that decision is called the placement engine.

The placement engine first filters out any servers that don’t have enough capacity to host your droplet depending on its size or the current occupancy. Then, it examines utilization attributes of the remaining servers, such as current memory, CPU, network, and I/O usage. For each attribute, the placement engine applies a corresponding weight that reflects its relative importance.

For example, memory might have a weight of 0.8. If a server has 82 GB of free memory, the placement engine multiplies 82 by 0.8 to get a weighted value. It performs this calculation for all other attributes, then sums the weighted values to produce an overall score for each server. The server with the lowest score is ultimately selected to host the new instance.

A weighted scoring system allows for greater granularity when load-balancing guests. In the case of my lab, memory then CPU are my biggest constraints. I want the placement engine to pick the node with lowest memory usage but only if the CPU utilization is not super high. I can express this logic quite simply using weights.

The playbooks

Getting the stats:

To get started, we first need to get the current stats for each of the nodes in the cluster. We'll take advantage of Proxmox's RESTful API to do this.

If you haven't already done so, create an Ansible user for your Proxmox cluster. You'll also want to generate an API key for the ansible user. Official PVE documentation covering this can be found here: https://pve.proxmox.com/pve-docs/pveum-plain.html

- name: Pull cluster utilization data
  hosts: localhost
  gather_facts: false
  vars:
    pve_host: node1
    pve_port: 8006
    pve_user: "ansible@pve"
    verify_ssl: false
    avail_node_mem: {}
    node_cpu_avg: {}

  vars_files:
    - pve_api_token.yaml

  tasks:
    - name: Connect to PVE API
      uri:
        url: "https://{{ item }}:{{ pve_port }}/api2/json/nodes/{{ item }}/status"
        method: GET
        headers:
          Authorization: "PVEAPIToken={{ pve_user }}!{{ pve_token_id }}={{ pve_token_value }}"
        validate_certs: "{{ verify_ssl }}"
        return_content: true
      loop: "{{ groups['primary-cluster'] }}"
      register: pve_node_status

    - name: Set cluster usage facts
      no_log: true
      set_fact:
        avail_node_mem: "{{ avail_node_mem | combine({ item.item: item.json.data.memory.available }) }}"
        node_cpu_avg: "{{ node_cpu_avg | combine({ item.item: item.json.data.loadavg}) }}"

      loop: "{{ pve_node_status.results }}"
What's happening:
  1. The playbook uses a get request to query the PVE API and request the node status endpoint.
  2. The responses are store in pve_node_status as JSON.
  3. We loop through that JSON and extract the 15 min average load for the CPU and currently available memory.
  4. This extracted data is pushed into a dictionary.
  5. Note: The secure token for API access is kept in an Ansible vault.
Finding the laziest node

And finally the placement playbook:

- name: Get stats
  import_playbook: get-cluster-stats.yaml

- name: Find least busy cluster node
  hosts: localhost
  gather_facts: false
  vars:
    cpu_w: 0.7
    mem_w: 0.3
    node_scores: {}

  tasks:
    - name: Calculate weighted node scores
      set_fact:
        node_scores: >-
          {{
            node_scores | combine({
              item.key: (
                (cpu_w * (node_cpu_avg[item.key].2 | float)) +
                (mem_w * (avail_node_mem[item.key] | float))
              )
            })
          }}
      loop: "{{ node_cpu_avg | dict2items }}"

    - name: Show weighted node scores
      debug:
        var: node_scores

    - name: Set lazy node
      set_fact:
        best_node: "{{ (node_scores | dict2items | sort(attribute='value') | first).key }}"
What's happening:
  1. Get stats imports previous playbook to get the stats from our cluster nodes.

  2. Find least busy cluster node calculates all our weighted values and adds up the scores. (Note: Jinja2 is not my strongest area so if mine could be a little cleaner, a little slack would be much appreciated.)

  3. Set lazy node sorts the dictionary low to high based on the score value, returning the lowest score (or least busy node).

Wrap-up

This playbook can easily be extended/modified to include other evaluation parameters such as I/O or storage capacity. In the future I may also modify this playbook to take into account guest configuration. In the next ansible post, I'll be covering SEIM response automation by building a SOAR system.

Previous Post Next Post