Optimizing Cloud Stability: A Q&A with Nx Director of Platform Product Management on Resolving Downtime Challenges
Recognizing the frustration that Nx partners and customers have experienced due to frequent Cloud downtimes and disruptions, Network Optix is taking a proactive approach to address these issues directly.
To provide open and honest communication to our customers and the Nx community, we had an in-depth conversation with Tagir Gadelshin, the Director of Platform Product Management at Nx, to delve into the areas of concern and the proactive measures being taken to enhance Cloud scalability and stability and ensure an efficient and reliable Cloud experience.
To begin, can you share some insights into past challenges and the proactive measures the development team is implementing to enhance Cloud scalability and reliability?
Sure. We are investing in Cloud Stability more than ever before. We've established a dedicated SLA team tasked with continuously monitoring Cloud functionality, swiftly responding to and resolving issues as they occur, and proactively implementing measures to prevent potential disruptions. We're actively expanding this team under the leadership of the SLA team lead, emphasizing a monitor, react, and prevent approach — with a primary emphasis on proactive prevention.
On the other side of things, we've designed a new architecture with scalability and reliability at its core, with new functions and services specifically designed to lower the risk of downtime. For example, if one instance, or server, encounters an issue, another one seamlessly takes over to ensure uninterrupted reliability. This forward-looking mindset is the basis of our design approach.
In terms of past challenges, downtime experienced during updates and maintenance is linked to legacy components of the Cloud. These legacy components were developed ten or so years ago when the Cloud was considered optional and supplemental. Now, in 2024, the Cloud has become central and is increasingly expanding into an endless number of applications. Addressing this, we're working on a major refactoring project for one of our cloud databases and authentication services to align with modern scalability standards. Specifically, we're enhancing its scalability from vertical to horizontal. Meaning, instead of just scaling up by adding CPU, memory, etc, we'll have the flexibility to scale out by adding multiple instances.
Another thing worth mentioning here is that we addressed a significant issue on the VMS server side that came about with the introduction of two-factor authentication in v5.0. Previously, a short unavailability of our Cloud authentication service would result in the disconnection of all users, making immediate log-in impossible. Essentially, any brief downtime had a direct impact on all currently connected users.
With the latest version, v5.1.2, users now remain connected even during downtimes of the authentication service, ensuring that such interruptions go unnoticed by users already connected. v5.1.2 is available for download now at networkoptix.com/my/download.
So, with Gen 6, it seems there's a dual strategy at play—a redesigned architecture for enhanced scalability and reliability, and a specialized SLA team dedicated to prevention and monitoring. Is that correct?
Yes, exactly.
Regarding reactive strategies, what improvements in response times or strategies can users anticipate with the establishment of the SLA team?
So, again we have established a monitor, react, and prevent approach in the establishment of the SLA team. The coordinated effort aims to eliminate potential downtime proactively, entirely behind the scenes. Keeping in mind that no service has 100% uptime and downtimes may happen, our SLA team is now working on improving internal incident response processes.
As part of it, we’re focused on reducing our Time To Detect (TTD) and Time To Mitigate (TTM) by supercharging our monitoring capabilities. For example, Prometheus is our go-to monitoring toolkit, and Grafana is used to visualize and analyze these metrics. All data points and logs are funneled into Elasticsearch, a powerful search and analytics engine. This setup empowers us to conduct thorough analyses of our system’s performance and health.
In the future, we plan to integrate machine learning models into our monitoring system. These models will be trained to spot errors and anomalies that might elude traditional monitoring methods. With these initiatives and tools in play, we aim to reduce response times drastically and ensure a smoother, more efficient operation.
In the future, we plan to integrate machine learning models into our monitoring system. These models will be trained to spot errors and anomalies that might elude traditional monitoring methods. With these initiatives and tools in play, we aim to reduce response times drastically and ensure a smoother, more efficient operation.
Understandably, uptime percentage is a major area of interest for customers. What level of uptime can be expected as a result of these initiatives?
Our SLA team aims to equal the uptime of our cloud host, AWS, as stated in the AWS SLA.