Postmortem Sitemetrics

During the last month, I’ve been working side to side with my friend and coworker Lucas Contreras on a web application. In this post, I will try to highlight both the things that went great and the things that went not-so-great.

The problem.

Everything we do on our daily lives arises from a problem we want to solve. This isn’t less true in the field of software development.

One day, I was finishing another project of my own, when Lucas asked me if I wanted to be part of a brand-new monitoring project. The very same was the pinnacle of the SRE discipline. We were asked to monitor both latency and error budgets of a large scale e-commerce site. For what we must develop from scratch a fully automated dashboard that displays all the previously fed info. The very same aimed to replace an already existing tool that due to its lack of maintenance suddenly stopped working. We all know that things don’t just stop working from time to time for no reason but in this case, tracing the reason and fixing the tool would have been a waste of time. The thing is this old tool was written by a person that was no longer part of the team and the tool was rather difficult to understand in terms of both code and design.

As if the situation wasn’t complicated enough already, we were told that this tool was used by very important people with high ranges, fine suits, and expensive shoes. People whose decisions can determine the future of our company.

You might guess, seemed kind of unlikely that the two (not only not Junior but “Shadowing”) members of the team were in charge of such a task. But as you might have experienced, seniority is (sometimes) something that companies like to play with.

So… what do we do?

After a short meeting with the client, we got a bit (and just a bit) more knowledge on the whole thing. We were given three weeks to recreate the whole system that was working before but with added functionality.

We started discussing a plan that allowed us to get it done in time. The first idea was to use React to develop the frontend and make a RESTful API for the backend, but we ditched it rather quickly: the time constraint didn’t allow for us to use a technology we were inexperienced in.

In the end, we settled on using AdonisJS, a JavaScript full-stack fully-fledged web framework made for when you need to deliver software as fast as bread comes out of the oven. This full-stack monster brings packed a handful of solutions that made our development much easier and let us focus on the design of the application. Things like authentication, making the connection to de DB and integrating an ORM was not an issue, so the next step was to think about how we were going to get that data from the site (latency metrics and error budget). Luckily, the old tool used Splunk for that.

App Infrastructure + 3 microservices
Having this scheduled “Splunk reports” that could be easily integrated with our dashboard over Weebhooks, all that was left was to build a friendly REST API where Splunk could POST all this data.

After some faulty merge requests and pushes directly to master, we managed to have the dashboard set except for the login. The large-scale company for which we were developing this app for uses LDAP. Yes, LDAP for everything. Dealing with it is kind of a pain but not accidentally we have already abstract away that complexity by building an API for this LDAP instance (you can read this post on building a microservice stack for more info). This way, something that could have taken us weeks was solved in a matter of days. Nevertheless, it added one dependency to nothing but the authentication, which means that if the API is down for some reason we wouldn’t be able to access our app. Finally, we solved this by persisting the user login on the Dashboard database and updating (if needed) every time the users change their LDAP password.

Conclusion

The most practical solution is usually the best one. Especially when you have limited time to do something (and that’s always). Adonis allowed us to quickly develop a working solution that met all the requirements we were given.

Remember to check out Lucas Contreras blog, where he talks about more techy things.

Bibliography and technology used in this project


Author: Lucas Contreras

Leave a comment: Join the Reddit discussion

A solution I'm partially proud of.

Sometimes coming up with the proper solution to a problem doesn’t depend just on your ability to code or how witty you are. Sometimes a little bit of patience and soft skills are needed. As you might have read in earlier posts, I’m part of an SRE team in a consultant company with HQ in Argentina, I’m currently working for an e-commerce based in the USA and all the <<Level 1>> monitoring site relies on us, or at least a very big part of it.

It was January 2018 when that hopeless ticket was opened. Its goal was clearly display on the title:

“Monitor disk space usage of new EC2 instances under the auto scaling group [Asg Name]”

For those who are not familiar with the AWS world and don’t know what an Auto scaling Group or an EC2 is, let me briefly explain. EC2s under and ASG are something like container’s clusters but with “bare metal” machines. Basically an EC2 instance is how AWS named his servers, there are several types of them and you can purchase them in ways you wouldn’t imagine. Per hour, per use, per instance, per color, shape and all. There’s even something like stock prices for these things. If you have several EC2 instances serving a single purpose, you might want to launch them all under an ASG (Auto scaling Group). This way, you ensure the right amount of EC2s are running in order the app doesn’t overload. Everything scales automatically if an EC2 fails or if more instances are needed, and if they are, you better prepare your wallet.

Like 13 months without it being solved, there it was: The impossible ticket, waiting to be solved. On the other side a hopeless young IT guy, managing systems he didn’t fully understand working with people that had even less scope on the matter. As you would expect, an unsolved ticket that old began to make some noise among the suit n’ tie guys. So almost 1 year and 2 months later, on our weekly catch-up meeting our manager took the word and said:

– “Ok guys. You might have heard about MON-5379“. -Everybody obviously knew what he was talking about-
– “It’s about to celebrate its first birthday.“ Continues sarcastically - “So guess what? We are stopping all ongoing projects until we get this done. Now, who wants to take care of the ticket ?“ . he asks

Believe it or not, by that time, I hadn’t even read the ticket. But since everyone was looking sideways in order to avoid the situation, I thought it would be a nice opportunity to stand out. I raised my hand. Big mistake. Everybody looked at me with their best wtf faces while I said:

– “I think I can take care of it“ - Oh boy, what was I getting into.
– “GREAT!“ Shouted my manager. End of the meeting.

Truth is that ticket haven’t been solved because the team didn’t have the required skills by the moment. The ticket was fully based on cloud infrastructure and since our client had recently migrated to the cloud, none of them were actually familiar with AWS by the time. I was no expert either.
If you’ve paid attention you might have realized what the problem was. How do we monitor anything on a machine that may die anytime soon while another one is waiting to take its place. It’s 2019, this an already addressed issue, we’ve already solved monitoring on clusters, have we ?

Yes, in a way. After some really intense research (literally the first post on Stackoverflow), I found AWS already has a solution for that. Auto scaling groups has this thing called ASG Webhooks (Spoiler: they are regular webhooks) where you can define actions to be taken every time the thing scales. But since the auto scaling was not in an account we had access to, adding webhooks was not a choice. Trying to explain this to the client is worthless, they don’t hire you to discuss technical stuff, they just want to have the thing monitored in case it fails, or at least that was what I thought. By the way, if the ASG takes care of scaling their instances up and down, is it really necessary to add monitoring to it ? Sadly, not a question to be asked in these circumstances. I had already 1 year old ticket and I needed to get it done not to delay my other projects.

After a day or two of searching the web, I went to my team leader with the following draft with an approach on how to monitor the instances cross account. Since we are responsible for the monitoring infrastructure, our aim was to have all resources almost fully independent from their maintenance or at least deploy the least amount of them on their account. Deploying all of the solution on our account was impossible since the instances and cluster were all on theirs. Si I managed to devise the following solution which required the client to set up just an event bus. ( AWS event buses are great for these cases but for the sake of making this post less boring I’ll have to ask you too Google every AWS service you don’t know.)

Rough draft I made of the cross account workaround

Since I was rather new in the team I didn’t get to talk to the client that much, but from what I had inferred from my co-workers and managers was that the client was an evil monster that didn’t want to collaborate or even work with us in solving anything. So I was asked to develop this solution with two of our accounts first in order to convince “the evil monster” otherwise they wouldn’t trust us. After a long weak dealing with automating this as much as possible so I didn’t have to replicate it by hand on their account, I got to talk to them in a call to show them what we had. All of this just to find out they were super nice and knowledgeable guys that literally said:

– “There’s no way we are letting you guys do all of that work when you can just add the resources on our account, we have no problem on that you could just have asked. I’m giving you permission so you can submit a pull request to our IaC template

I was 2 lines of code to murder someone. Luckily, I contained and wrote this blog post instead.

Finally, I submitted the #PR to their Cloudformation (AWS IaC service) and managed to solve the problem with half the resources and a third of the code.

Conclusion

There’s more than code and good practices in the IT world. Sharing ideas, being empathic and a continuous and fluid communication is as important as writing clean code.

Bibliography and technology used in this project


Author: Lucas Contreras

Leave a comment: Join the Reddit discussion

The Shhlack-bot

If you haven’t read my other post on Microservices… don’t worry, it’s totally unnecessary in the understanding of this one, but it will make more sense if you have.

Back in the days when the revolution of #Microservices&APIs was a trending topic in my team, we thought that some critical alerts could be sent via Slack. We all agreed that that an API-Chat would be needed in case we wanted to make things right. Luckily we had two trainees whose skills needed to be tested, so we put them in charge of developing a RESTful API-Chat which was able to abstract various ways of sending messages to different chat platforms. Since Slack was the main chat platform, we decided to go with that integration first.

Once the API was finished we realized we didn’t need to deal with Slack-apps, attaching Slack-bots to channels or setting up tedious Webhooks. We got completely independent of modules like python-slackclient or node-slack-sdk. Everyone in the office could send a message through Slack with a single curl.

By the time the office was getting more and more crowded as new people were hired. We were running out of quiet places and every time someone had an important meeting we hopelessly asked for silenced.

It was time to demonstrate how powerful this kind of abstraction could be, so my friend and I decided to build The Shhlack-bot. Built from a NodeMCU (ESP8266) and a sound detection sensor (KY-037) we managed to build our toy, a simple device that would send a message via Slack when ever the office was too noisy.

At first sight the spinning wheel in the sensor, with which you regulate the sound levels, looks like it has infinite turns. We thought our little buddy wasn’t working until we read that it has only ten turns, and then you would hear a gentle * click * that will let you know that you are at the max/min decibels possible even though might still be able to turn the wheel right or left. Only then did we realize how to properly setup the sound levels.

Spamming slack be like..

In case you wan’t to replicate our shhlack-bot, I’ll leave the code and techs used in this project at the end of the post. Please consider leaving a comment or upvoting this post in the following subreddit

Bibliography and technology used in this project


Author: Lucas Contreras

Leave a comment: Join the Reddit discussion

My experience with the Microservices approach

Introduction

So it was like a few months ago when a friend and co-worker first told me of the microservices approach. As you might or might not know I’m part of a Site Operation team where we provide service to a sometimes complicated client in terms of bureaucracy, which means that we have to adapt to their policy changes and permissions all the time making most of our work complexity quite arbitrary. It was then when we thought that a service-oriented approach might save us some time on the day to day tasks. Before telling you my experience I invite you to read about what service-oriented modeling (SOM) is.

My experience

The idea of making our daily work more service-oriented was already rumbling in our heads when we first set a meeting with our manager and told him what we have in mind. A few words with him and one or two rough drafts opened our way to a more formal meeting with the whole team in order to present what was going to be “The future of Sitetops”, a “New way of working”, “A game changer”, “The microservices approach”. The SAAS (SiteOps as a Service) project.

We had one week to prepare some slides and think how we were going to convince the rest of the team that this was the way to go. After a long meeting with quite a lot of arguments and motivational quotes on how this would be a great solution to our problems the meeting got to an end. It was a fact, from that moment until now, our job was to think in a more service-oriented way.

Along with the now urgent need to make everything-as-a-service me and my friend got assigned a new development project. We had two months to develops a fully MVP that would interact with our client’s ticket manager in order to easily and quickly manage all the incidents on the site. The aim of the tool was to somehow centralize all the ways our client had to report and incident over the different communication software with just a few clicks. So after a quick analysis we realized our app would have to communicate with at least 3 of our client services/tools. It was the perfect excuse not to leave our idea of Microservices die along with all the words said in that meeting. Now let’s make a list of the problems we had by that time.

Problems to solve

Our main difficulty was that we had quite a lot of scripts that interacts in very different ways with the different applications the client has. Whenever anything changes on their side, most of our monitoring failed. Debugging where the error was coming from was a pain in the ass. And the worst of all were those critical alerts that are supposed to be triggered on very specific occasions but when they do you know the thing’s fucked up, well, those always failed silently and we rarely realized on time .
Another big issue was that a significant amount of code we had running was kind of legacy, and legacy is polite way of saying it was written in Perl [* Takes a deep Breath… continue writting. *]. And as you might know Perl doesn’t have that magic from problem import solution Python does. Making easier to manage this kind these complex and large lines of code was something that needed to be solved.
Even if we got the opportunity to do the thing in a language of preference, some of their APIs were quite difficult to work with and the lack of consistency between them was something we needed to get rid of. Oh! they also expected us to hand in a fully functional MVP in two month.

How we faced the problem

After telling everyone how we were planing to solve this stack of problems with a single .pop() the most heard and trendy tech to use was Serverless. Since we didn’t have a clue what Serverless was we decided go for it. And we did, we established that if we couldn’t manage to have a simple serverless API up and running by the end of the week we would evaluate doing it with containers or just common web servers. The POC was a success and by a day or two we got our first AWS Lambda function up and running on the cloud, completely serverless.
In something like two months or so we had a single app that consumes 3 independent services (also developed and deployed along the development of the app) which were able to quickly escalate and could be reused by the whole team for there daily operations.Of course as each service is context agnostic, they have their own repo and they are independently deployed.

Our app infrastructure looked something like this:

App Infrastructure + 3 microservices
By the end of the project we got a the MVP ready and with it an application fully composable that can be selected and assembled in various combinations to provide its functionality to a broad variety task to solve a wide variety of problems.

I’m proud to say that at least three problems were addressed:

1. Abstract away complexity of the apps and tools that don’t depend on us. Making them consumable from literally anywhere.
2. Centralize the problem of the unexpected mutability of our clients apps making it easier to solve and debug.
3. Handing a fully scalable and high reliable system which is the running prove that a service-oriented modeling approach can prevent quite a lot of headaches in the future.

Conclusion

Sometimes working with new technologies can mean getting out of our comfort zone leaving us in an insecure place in a field we might have short or not knowledge at all. But we have to take into account that we are not alone, we rarely pioneers in what we are doing and if you are lucky you can find people who have written some good documentation addressing the same problem you are going through right now. Then, it’s just a matter of learning the new and convince ourselves that it’s better to solve the problem the way is meant to solved rather than the way we would like to solved it.

If you got up to here, thank you very much for reading if you have any questions/opinions or you’ve found any mistake on the text please feel free to write me, my contact details can be found in the /about section.

Bibliography and technology used in this project


Author: Lucas Contreras

Leave a comment: Join the Reddit discussion

My other blog

Canuslector Site

Introduction

As my first post I would like to talk about one of my firsts projects. This post would not be exactly about code, nevertheless it might have a bit of it. Canuslector Blog is my personal blog where I post different philosophical ideas I found interesting. I also share some of my favorites videos on Youtube. All of the content goes more or less around the same topic or that’s what I intend. One thing’s for sure, they all go around me.

The Acronym

Yeah.. You might have realized by now that the word canuslector is an acronym for lucascontre. My actual name is “Lucas Contreras” but my nickname is “Lucas Contre” or just “Contre”. Now, why “Canus Lector“ ? Well, it’s Latin for “old reader“ and not really. It’s correct translation according to Google’s translator is “aging reader“ but it was the closest to a coherent acronym of my nickname I could get.

Is it even a blog ?

Back in July 2018 , when I first sat my ass to learn some decent nodejs was when I realized which was going to be my next project. Of course I needed to be something that had my head occupied for the next month or so, not to fall again on this stupid winter depression that comes along with finals. Besides I’ve always wanted to share what I read about in those old and boring philosophy books but I never had the guts to write some profound and inspiring essay of my own. So I decided that the best way of sharing it was to copy&paste the raw content but this time not form StackOverflow but from my books.

So I gathered all of my favorite books from under my bed (yeah, there’s where I keep’em) and I started skipping pages in order to got some good content. Luckily, I’ve never cared much about book’s being written with ink or had a few unnecessary pages ripped off. People tend to think that the unalterable and sacred purity of a book must not be corrupted just because it’s a book. I don’t really know what’s the deal with it, so I had all of them underlined which made the process of selecting which part to upload much more easier. I remember the first thing I uploaded was about philosophy it self, I hate what people think about what philosophy is and what it takes to be a philosopher. So I posted this quote where one of my favorite Authors bales on every living philosopher at the time saying that they are nothing but just teachers, instead philosopher should be called those who loved wisdom and lives according to its dictates.

Why from scratch ?

It’s 2019, there nothing more mainstream that owning a blog, so why the f** I needed to code it all from scratch in a language which was completely new to me. Clearly to learn, but mostly to see which kind of problems I stumbled upon and to prove my self that I was able to solve them “The good way” and fully understanding the root cause and not just copying and paste code from elsewhere to make the thing work at any expense. Something simple system that just needs to provide simple CRUD for a blog entry could not be that hard, right ? Well, at first it was.

Even if succeeded on the first task (Learning), I totally screw it on the second one. I ended up just doing things I did not understand and the code got a bit dirty by the end of the project. Nevertheless, a few month of having the blog up and running I wrote a long Readme.md on how to deploy the monster and with it refactored the code which leaded to a big understanding and a loud “Oh! So that was that a models meant!”. Now it’s not the best code on earth, it might still be a Frankestein half api-frontend and half a backend rendering twig templates but I’m happy with it.

Videos in a blog ?

Blame him: Exurb1a

Conclusion

Having personal projects is one of the best things you can do to learn and since you get to chose what technology to use is a good reason to way catch up with the latest techs available. It’s a good way of proving yourself and test the things you won’t test on a company’s production system. But my advice is to put that personal project to a good use. Give it a purpose rather than just coding for the sake of learning. Make a good use out of that project or at least do something that worth something for someone, this way you will also learn about the “arbitrary complexity” you would probably avoid if you code for your own.

Finally I invite you to visit my other blog ( https://canuslector.site ). I know it might not be for you since you probably got here because you are more into tech than other thing or because it is written in spanish. Any of both ways you would either think out of the box or learn another language :D


Author: Lucas Contreras

Leave a comment: Join the Reddit discussion