Friday, 15 January, 2021 UTC


Summary

Sqreen’s architecture has evolved a lot over the years. As one of the main protagonists in all these changes, I’m often asked about the previous steps we took and the rationale behind them. It’s an interesting, albeit long, conversation, so l thought I’d take a trip down memory lane and share some of the decisions we made as we built Sqreen and why. 
As I mentioned, it tends to be a long conversation, so I’m going to break this into three posts to make it more digestible. Let’s start with the beginning of Sqreen.
First a more general note. The Sqreen backend was mostly built following a few constraints & principles which are commonly shared by Sqreen engineering:
  • Antifragility: No system is ever perfect, but when a part of the system goes bad, we need to fix it and improve it so it won’t be a problem for the future.
  • Scale as we go: The solution we choose at any given time needs to work for our current needs. It should not be sized too small or too big.
  • Maximize our people power: At Sqreen we are all in this together, but there are never enough people or time for all the things we want to do. So we strongly prefer solutions that do not need any humans to get involved too often.
Episode I: The Sqreen Proof of Concept
When Pierre and Jb started Sqreen, they had a dream. They wanted to tighten the feedback loop between developers & security people. The solution they envisioned was adding a small agent inside web applications that would detect issues as they happen, stop them, and report back. The agent configuration had to be controlled remotely so users could choose between a monitoring-only mode (no destructive actions like blocking requests) or a protection mode (including actions like blocking requests).
What was needed to do an early proof of concept of Sqreen then was:
  • A backend system able to send commands to the agent to activate the different modes
  • An agent, taking the form of a Ruby gem. It was able to launch with the hosting application and connect to the backend
  • A dashboard to be able to visualize reports
To quickly get something off the ground, the Sqreen founders chose to use Meteor to build the dashboard. Meteor enables developers to easily create rich real-time applications by writing Javascript code that is then able to run either in the frontend (browser) or on the backend (server-side). It also requires the use of MongoDB for data persistence. 
So then, since users would configure the agent through the dashboard, our backend needed to read this information in Mongo. Jb & Pierre, being familiar with Python and Flask from previous projects at Apple, chose to use this stack to build our backend for agents. 
For the first weeks of Sqreen, this merry stack was tested locally and most of the time we invested was in polishing the agent. Eventually, we reached the point where we felt that the agent was polished enough for an alpha test in an actual production environment. We reached out to our network (actually to my previous company) and they agreed to deploy in production.
Deploying in production
To be able to deploy in production we needed to host the solution somewhere. Now, to ease our local testing, we had packaged both the dashboard and backend in Docker containers. So we chose to start using AWS, more precisely their ECS service, to deploy our PoC. ECS actually works by controlling an “agent” container installed on each server of the cluster. Capacity for the cluster (at the time) had to be provided externally, for example using EC2. Given the idiosyncrasies of ECS at the time, we launched two t2.large burstable computers to run our PoC.
Since we were first-time users of AWS, we got some help from an AWS architect to help us set up the early infrastructure. When everything seemed to work smoothly, we started the alpha test. As with all good alpha tests though, the agent started crashing in the first few minutes of meeting an actual production environment (though fortunately our safety net functioned correctly and the website wasn’t affected; it was just not protected). It turns out that we had failed to account for all the versatility of the SQL syntax and missed a case. This case, as Murphy’s law predicts, was being used on a heavily trafficked page on the target website. The agent had to be disabled while we fixed this. 
The fix itself was actually pretty simple, but our nice friends working with us on deploying the alpha were not (yet) into continuous deployment so the next window of opportunity for a second test had to wait a week or so. Humbled by this first failure, we knew there was no way that by the time of the second test, we would have covered all the corner cases of SQL. We needed to fix this. It turns out that we already had a plan for the future to use dynamic security rules. 
Adding dynamic security rules
Dynamic security rules were a tradeoff for us. On the plus side, it would be a bit simpler for us to change the rules in cases of issues, but on the negative side, it would make the system a tad harder, as the agent would become more complex and rules would need to be cryptographically signed offline so they could not be tampered with. 
We weighed the tradeoffs, and decided to take this path. Since Rails applications already had facilities to work with Javascript assets, we chose to use Javascript for rules. The reason behind this decision being that we could then share these rules between all agents in the future when we built more agents for other languages (Java can execute JavaScript natively thanks to Nashorn, no question for Node.js, etc.). 
A week later, when we redeployed at our alpha test user, everything went smoothly! I mean, yes, as planned, our new JS-powered SQL query was again missing some part of the syntax… But this time we just had to fix the rule code, re-sign it, upload to production, and send a command to the agent to reload the rules. Proof of concept achieved!
Episode II: The first regular users
Having validated our alpha test, it was time to open up the doors to more of our interested potential users. The Paris’ tech scene being rich in Rails startups at the time, it wasn’t too hard for us to find people that would be interested in a turnkey product that would make their application security posture better. We quickly had a few companies happily testing the solution. It was also time for us to start expanding to more technologies. The first one being Python so that, as we are fond to say: Sqreen could sqreen Sqreen! This was quickly followed by Node.js, the new and upcoming tech that startup founders were using to start companies. 
Creating a user-ready dashboard
On the architecture side, the PoC dashboard was starting to show its weaknesses. Meteor was very nice to get something quickly off the ground but it was very resource-intensive. In particular, the data access layer was struggling with data-heavy views like the attack log that would get refreshed multiple times a second because new attacks were constantly being detected. Since it was clear to all of us that Sqreen was going to be very data-centric, we decided to invest in a sounder foundation for our needs and removed Meteor. 
At the time, the best candidate for the need we had was to create a Single Page Application using React. This React application would need to connect to a backend system that would, in turn, connect to the database. Since we already had a data model expressed in Python for the backend for agents, we decided to reuse this and add a new set of endpoints (actually a new Flask application) to the backend project we already had. As such, the Backend For Front (BFF for short) was born. This backend is tightly coupled with the new Dashboard, and API contracts are designed specifically for the needs of the React application, meaning that the new Dashboard + BFF combo made for an excellent replacement for the experience we had with Meteor.
Tracking activity in Sqreen
Another growing concern was around tracking the activity of the platform. For example, which attacks were just reported and in which users’ applications? Or simply what is the content of the last error captured by the agent safety net so we can act on it. In the very beginning this information was a simple MongoDB query away, but this was not a scalable solution. The new agents’ owners or dashboard team members cannot be expected to connect to the production database to check their data. Also since we developed new features first in the Ruby agent (Python & Node were catching up fast but were not yet there), we wanted a production application on which to dogfood the Ruby agent. That meant that it was time to create an internal back-office application, the AdminInterface. Since it’s an internal-only app that we would also use to dogfood, it was created using Ruby on Rails and connected to our MongoDB database.
Upgrading our infrastructure by breaking things
For a while everything went smoothly. We steadily added new beta testers and features following early product feedback. Until one day, suddenly the production appeared to stop working (as the story goes, it occurred while Jb & Pierre were pitching Sqreen to investors with the goal of raising a funding seed round)! Up until this point, we didn’t have much monitoring in place. We had installed an APM (New Relic) on the backend but not much beyond that. Apart from showing that everything was slow, our monitoring didn’t really give us a clear cause. After some digging, it turned out that we missed something when building our infrastructure… 
If you recall, we used t2 computers because they were the cheapest, but had failed to understand just why that was the case. The T-class of AWS EC2 computers are actually called “burstable”, meaning that one is not supposed to use the full power of the CPU all the time. On average, the CPU consumption must stay under a threshold (given by the size of the computer chosen). When usage is over this threshold the computer burns credit, and when the credit balance reaches 0, AWS slows the activity down significantly to make up for the overconsumption! This is what happened to us. Lessons learned, we stopped using T-class machines in favor of the M-class ones (which are meant for general-purpose computing). We were now ready to meet the needs of our first set of regular users. Thanks to safety mechanisms designed in our agents, such failures never impact the performance of our customer’s applications – only those of our backend.
Up next in part two, we’ll take a look at how we scaled up from handling the needs of our first users to handling the needs of our first customers and beyond, and all of the new features we developed along the way. Stay tuned!
The post Sqreen’s architecture through the ages: part one appeared first on Sqreen Blog.