Optimizing Cloud Stability: A Q&A with Nx Director of Platform Product Management

Cloud Scalability Q&A

Optimizing Cloud Stability: A Q&A with Nx Director of Platform Product Management on Resolving Downtime Challenges

Cloud Stability Blog

Recognizing the frustration that Nx partners and customers have experienced due to frequent Cloud downtimes and disruptions, Network Optix is taking a proactive approach to address these issues directly.

To provide open and honest communication to our customers and the Nx community, we had an in-depth conversation with Tagir Gadelshin, the Director of Platform Product Management at Nx, to delve into the areas of concern and the proactive measures being taken to enhance Cloud scalability and stability and ensure an efficient and reliable Cloud experience.

 

 

To begin, can you share some insights into past challenges and the proactive measures the development team is implementing to enhance Cloud scalability and reliability?

Sure. We are investing in Cloud Stability more than ever before. We've established a dedicated SLA team tasked with continuously monitoring Cloud functionality, swiftly responding to and resolving issues as they occur, and proactively implementing measures to prevent potential disruptions. We're actively expanding this team under the leadership of the SLA team lead, emphasizing a monitor, react, and prevent approach — with a primary emphasis on proactive prevention.

On the other side of things, we've designed a new architecture with scalability and reliability at its core, with new functions and services specifically designed to lower the risk of downtime. For example, if one instance, or server, encounters an issue, another one seamlessly takes over to ensure uninterrupted reliability. This forward-looking mindset is the basis of our design approach.

In terms of past challenges, downtime experienced during updates and maintenance is linked to legacy components of the Cloud. These legacy components were developed ten or so years ago when the Cloud was considered optional and supplemental. Now, in 2024, the Cloud has become central and is increasingly expanding into an endless number of applications. Addressing this, we're working on a major refactoring project for one of our cloud databases and authentication services to align with modern scalability standards. Specifically, we're enhancing its scalability from vertical to horizontal. Meaning, instead of just scaling up by adding CPU, memory, etc, we'll have the flexibility to scale out by adding multiple instances.

Another thing worth mentioning here is that we addressed a significant issue on the VMS server side that came about with the introduction of two-factor authentication in v5.0. Previously, a short unavailability of our Cloud authentication service would result in the disconnection of all users, making immediate log-in impossible. Essentially, any brief downtime had a direct impact on all currently connected users.

With the latest version, v5.1.2, users now remain connected even during downtimes of the authentication service, ensuring that such interruptions go unnoticed by users already connected. v5.1.2 is available for download now at networkoptix.com/my/download.



So, with Gen 6, it seems there's a dual strategy at play—a redesigned architecture for enhanced scalability and reliability, and a specialized SLA team dedicated to prevention and monitoring. Is that correct?

Yes, exactly.
 

 

Regarding reactive strategies, what improvements in response times or strategies can users anticipate with the establishment of the SLA team?

So, again we have established a monitor, react, and prevent approach in the establishment of the SLA team. The coordinated effort aims to eliminate potential downtime proactively, entirely behind the scenes. Keeping in mind that no service has 100% uptime and downtimes may happen, our SLA team is now working on improving internal incident response processes.

 

As part of it, we’re focused on reducing our Time To Detect (TTD) and Time To Mitigate (TTM) by supercharging our monitoring capabilities. For example, Prometheus is our go-to monitoring toolkit, and Grafana is used to visualize and analyze these metrics. All data points and logs are funneled into Elasticsearch, a powerful search and analytics engine. This setup empowers us to conduct thorough analyses of our system’s performance and health. 

In the future, we plan to integrate machine learning models into our monitoring system. These models will be trained to spot errors and anomalies that might elude traditional monitoring methods. With these initiatives and tools in play, we aim to reduce response times drastically and ensure a smoother, more efficient operation.

 

Understandably, uptime percentage is a major area of interest for customers. What level of uptime can be expected as a result of these initiatives?

Our SLA team aims to equal the uptime of our cloud host, AWS, as stated in the AWS SLA.

 

Lastly, when it comes to communicating issues with users, do we have plans for a customer portal to display Cloud status and potential disruptions?

Absolutely. The SLA team is actively engaged in Cloud monitoring, utilizing tools to study what's happening currently and in the past with real-time and historical data. As part of the roadmap for this team, we plan to incorporate the public-facing features of these tools within our Cloud interfaces to enhance transparency and allow customers to view Cloud status at any time.

In the meantime, we regularly send detailed communications to our direct Partners, notifying them of any unexpected incidents or scheduled maintenance, and constantly work to improve this process based on feedback we receive.