A solution I'm partially proud of.

Sometimes coming up with the proper solution to a problem doesn’t depend just on your ability to code or how witty you are. Sometimes a little bit of patience and soft skills are needed. As you might have read in earlier posts, I’m part of an SRE team in a consultant company with HQ in Argentina, I’m currently working for an e-commerce based in the USA and all the <<Level 1>> monitoring site relies on us, or at least a very big part of it.

It was January 2018 when that hopeless ticket was opened. Its goal was clearly display on the title:

“Monitor disk space usage of new EC2 instances under the auto scaling group [Asg Name]”

For those who are not familiar with the AWS world and don’t know what an Auto scaling Group or an EC2 is, let me briefly explain. EC2s under and ASG are something like container’s clusters but with “bare metal” machines. Basically an EC2 instance is how AWS named his servers, there are several types of them and you can purchase them in ways you wouldn’t imagine. Per hour, per use, per instance, per color, shape and all. There’s even something like stock prices for these things. If you have several EC2 instances serving a single purpose, you might want to launch them all under an ASG (Auto scaling Group). This way, you ensure the right amount of EC2s are running in order the app doesn’t overload. Everything scales automatically if an EC2 fails or if more instances are needed, and if they are, you better prepare your wallet.

Like 13 months without it being solved, there it was: The impossible ticket, waiting to be solved. On the other side a hopeless young IT guy, managing systems he didn’t fully understand working with people that had even less scope on the matter. As you would expect, an unsolved ticket that old began to make some noise among the suit n’ tie guys. So almost 1 year and 2 months later, on our weekly catch-up meeting our manager took the word and said:

– “Ok guys. You might have heard about MON-5379“. -Everybody obviously knew what he was talking about-
– “It’s about to celebrate its first birthday.“ Continues sarcastically - “So guess what? We are stopping all ongoing projects until we get this done. Now, who wants to take care of the ticket ?“ . he asks

Believe it or not, by that time, I hadn’t even read the ticket. But since everyone was looking sideways in order to avoid the situation, I thought it would be a nice opportunity to stand out. I raised my hand. Big mistake. Everybody looked at me with their best wtf faces while I said:

– “I think I can take care of it“ - Oh boy, what was I getting into.
– “GREAT!“ Shouted my manager. End of the meeting.

Truth is that ticket haven’t been solved because the team didn’t have the required skills by the moment. The ticket was fully based on cloud infrastructure and since our client had recently migrated to the cloud, none of them were actually familiar with AWS by the time. I was no expert either.
If you’ve paid attention you might have realized what the problem was. How do we monitor anything on a machine that may die anytime soon while another one is waiting to take its place. It’s 2019, this an already addressed issue, we’ve already solved monitoring on clusters, have we ?

Yes, in a way. After some really intense research (literally the first post on Stackoverflow), I found AWS already has a solution for that. Auto scaling groups has this thing called ASG Webhooks (Spoiler: they are regular webhooks) where you can define actions to be taken every time the thing scales. But since the auto scaling was not in an account we had access to, adding webhooks was not a choice. Trying to explain this to the client is worthless, they don’t hire you to discuss technical stuff, they just want to have the thing monitored in case it fails, or at least that was what I thought. By the way, if the ASG takes care of scaling their instances up and down, is it really necessary to add monitoring to it ? Sadly, not a question to be asked in these circumstances. I had already 1 year old ticket and I needed to get it done not to delay my other projects.

After a day or two of searching the web, I went to my team leader with the following draft with an approach on how to monitor the instances cross account. Since we are responsible for the monitoring infrastructure, our aim was to have all resources almost fully independent from their maintenance or at least deploy the least amount of them on their account. Deploying all of the solution on our account was impossible since the instances and cluster were all on theirs. Si I managed to devise the following solution which required the client to set up just an event bus. ( AWS event buses are great for these cases but for the sake of making this post less boring I’ll have to ask you too Google every AWS service you don’t know.)

Rough draft I made of the cross account workaround

Since I was rather new in the team I didn’t get to talk to the client that much, but from what I had inferred from my co-workers and managers was that the client was an evil monster that didn’t want to collaborate or even work with us in solving anything. So I was asked to develop this solution with two of our accounts first in order to convince “the evil monster” otherwise they wouldn’t trust us. After a long weak dealing with automating this as much as possible so I didn’t have to replicate it by hand on their account, I got to talk to them in a call to show them what we had. All of this just to find out they were super nice and knowledgeable guys that literally said:

– “There’s no way we are letting you guys do all of that work when you can just add the resources on our account, we have no problem on that you could just have asked. I’m giving you permission so you can submit a pull request to our IaC template

I was 2 lines of code to murder someone. Luckily, I contained and wrote this blog post instead.

Finally, I submitted the #PR to their Cloudformation (AWS IaC service) and managed to solve the problem with half the resources and a third of the code.

Conclusion

There’s more than code and good practices in the IT world. Sharing ideas, being empathic and a continuous and fluid communication is as important as writing clean code.

Bibliography and technology used in this project


Author: Lucas Contreras

Leave a comment: Join the Reddit discussion