How to scale up/down self-hosted agentpools AzureDevOps

Recently, we needed to prepare for migration from self-hosted gitlab in Kubernetes cluster to AzureDevOps (SaaS). I took the lead to setup the environment (self-hosted agentpools) as it’s new to me and I wanted to discover AzureDevOps.

I followed the official document from Microsoft to set it up. However, I got some minor errors in their document. I also had some improvement in regarding to startup time for self-hosted agentpools. In this post, I won’t describe about them. But if you want to know exactly about them, you can get in touch with me.

One of very first questions after having set up an agentpool deployment as shown in Microsoft document was how we could scale up/down number of replicas (self-hosted agentpool) based on requests. As you might already know about HPA in Kubernetes; so we need to create custom metrics to scale up/down self-hosted agentpools. The difficulty is that I couldn’t find any official document of AzureDevOps for how to know when we need to scale up/down agentpools.

After having read information somewhere in the Internet (I didn’t recall correctly) and checked it myself. I realized that we need to scale up an agentpool when a jobrequest doesn’t have assignTime (the jobrequest is waiting for an agentpool). Here is the request to get information about jobrequests sent to AzureDevOps:

curl -u $USER_NAME:$AZP_TOKEN https://dev.azure.com/$YOUR_ORGANIZATION/_apis/distributedtask/pools/$AGENT_POOL_ID/jobrequests

So we can easily calculate number of jobs that are waiting for agentpools by using the following query:

curl -u $USER_NAME:$AZP_TOKEN https://dev.azure.com/$YOUR_ORGANIZATION/_apis/distributedtask/pools/$AGENT_POOL_ID/jobrequests | jq '.value.[] | select(.assignTime | . == null or . == "")' | grep requestId | wc -l

If we send this metric to Prometheus, we can easily scale up agentpools based on this metric with Prometheus.

After scaling up agentpools, the question switches to when we can scale down agentpool to save cost. For that, we don’t have official document either. However, if a job request doesn’t contain the field finishTime, it means the job request is running (suppose that the job request already has assignTime because a job request without assignTime won’t have finishTime)

(You can canculate the running job by subtracting jobs without assignTime or by adding another condition to filter it from JSON select). However, if we use Prometheus to scale down, we need to wait for all running jobs completed because we cannot tell Prometheus to terminate such replicas.

To overcome that, we scale down ourselves (without using Prometheus). We can know which pods are running jobs by using:

Then, we can easily find pods of agentpool deployment that are not found in the above query. Those pods are ready to get terminated.

How to scale up/down self-hosted agentpools AzureDevOps

Comments

Leave a Reply Cancel reply