Devops on a Tiny Cluster
I was recently involved in an interesting project. Deploying a full production and development environment on a very budget-constrained Kubernetes cluster, managed through GKE. A big departure from my usual, where I have nearly unlimited budget for my cluster. The issues I ran into, and the solutions for them, were actually the inspiration to start this blog, just so I could write this post.
The Cluster and the Requirements
The requirements were, for the most part, pretty standard. Frontend and backend node.js deployments, MongoDB, and a machine learning model hosted on a Django API, all of which would be deployed in both a production and staging environment. On the same cluster, there would also be a Jenkins for CI/CD.
With the budget provided, I came up with the following configuration: 3 x g1-small machines (shared vCPU, 1.75 GB RAM), 1 x preembtible n1-standad-1 (1 x vCPU, 3.75 GB RAM). The preemptible machine is actually the most important part of the cluster, and hopefully you'll understand why through the rest of this blog post.
For those out of the loop, a preemptible node is one that is offered at a much cheaper price by Google, but at the cost of it being ephemeral. It has a maximum guaranteed lifetime of 24 hours, after which it can be destroyed and recreated at any point.
Problems and Solutions
The Machine Learning API
The ML API would pull about 1GB of RAM on its own, just idling. This usage would go up whenever it was queried. This led to it getting evicted very often. At one point, I didn't look at the cluster for an entire weekend, and came back to 4 evicted pods. Eventually it got spun up on the same node as the backend, and it prevented the backend from running, crippling the functionality of the website.
To fix this, I forced the Machine Learning model to run on the preemptible node. The model would simply analyze certain user content submitted to the website, and it was deemed acceptable that, during the several minutes when the node was preempted, any user content submitted could simply be given an "average" score. This way, the website remained functional, and user experience was not impacted in any noticeable way. While using a preemption policy would also work here, this gave the model a higher uptime.
Jenkins was the next problem that needed tackling. A requirements was that the Jenkins UI would have to remain accessible at all times, and every branch and PR would have to be built and tested. However, depending on which node the slaves would be spun up on, the process of
npm install && npm build would often lead to evictions.
This was fixed by assigning the slaves to run on the preemptible node. Again, with 3.75 GB of RAM available, that node could easily handle the compilations required by NPM as well as the Machine Learning model. This does occasionally lead to some jobs failing due to a preemption happening part way through, but this is easily identifiable, and jobs can simply be restarted.
Even with this, I added some more optimization. I took the existing JNLP image, and added the node.js binaries to it. This saved just a bit of resources - instead of builds requiring two containers, they could now run in a single one.
The website is deployed as 4 Kubernetes deployments - the frontend, the backend, the database, and the machine learning model. We'll approach the issue with the ML model later, for now we'll focus on the front and backend deployments.
I initially approached these in the most sensible way, in my mind: first, an init container would
git clone the repository into a mounted volume. The next init container, using the same volume, would
npm install && npm build, and then the final container would
npm start. This worked, but again, evictions. It turns out that with only 1.75 GB of RAM on each node, the process of building the application was too much.
This required a slightly more creative approach. With Jenkins working, though, I had a solution. First, I deployed a simple NFS server to the cluster, exporting a 10 GB volume. This volume would be mounted on Jenkins slaves. Then, a parametrized Jenkins job to build an "artifact" was made. This job would run the usual steps -
npm build, and
npm test. However, after the final step, it would copy over the entire repository, with
dist included, into the mounted NFS directory. Depending on the parameters passed, it would either copy them into a
prod directory. The directory name given to each artifact was set to the current date/time, and it would then be symlinked to a "latest" artifact. Some small optimizations, such as checking the HEAD commit ref for the built artifact and the latest artifact, were also included, to prevent storing duplicates of the same build. If there were more than 5 artifacts present, the oldest would be deleted at this point.
Next, the init containers were modified. Instead of pulling and building, the only init container would mount the NFS share and the application volume, copying the artifact to the application volume. The pod would then mount the application volume, and run a simple
npm start. No more building on anything besides the preemptible node now.
Pod Preemption Policy
Even with all this, particularly busy days would still lead to evictions. Luckily, Kubernetes includes pod priorities, which lets the user define which pods are more important, and which are okay to evict. Once those were setup, the order is rather clear. Staging environments on the bottom, since those aren't customer facing. Jenkins next, since it is required to build, but still not customer facing. The frontend, backend, and database all share equal priorities, since they are all essential to the product running. Since this has been implemented, though, no amount of traffic has lead to the eviction of any critical pods.
So far, everything has been by the book - using generic images and configuring them through init containers for fast, easy deployments. I wanted to do the same for the machine learning model, but the issue was the size - about 1 GB for just the model.
One option would obviously have been hosting it on the NFS server, but I was trying to keep costs low, and I didn't want a larger hard drive. I also didn't want our model hosted on the internet, as it had to be kept internal to the Google Cloud project. Finally, I threw in the towel for this one. I didn't have any solution more elegant than just building an image that included the model and the server. I deemed this acceptable for this case and this case only, because this application would be updated much less frequently than the main website.
Just... Don't. I tried to have some Stackdriver metrics, but it turned out to lead to more evictions yet again. The only monitoring are uptime checks on the frontend and backend endpoints.
Other Lessons Learned
Don't try and be fancy. Kubernetes offers a ton of great features, but they shouldn't take priority over a stable, working product. I messed around with daemon sets, a full cluster of only preemptible nodes (there will be a future post about that), and even at one point, a single-node cluster. I tried deploying ArgoCD, but it was too resource intensive. The best solution was really just using the standard best practices, but adapting them to fit the size and scope of the cluster.
Stress testing was also important. I went from seeing critical pods getting evicted left and right to the network giving out before the deployments, which is exactly what I wanted to see.
Next, I'm going to look at how to cut costs even more on this cluster. At our current scope, we have the two deployments, build server, and backups coming out to about $75/month. Stick around for that.