August 9, 2023

Modern Data Platform and the Cloud. Part 2

By Mikhail Stolpner · 5 minute read

Modern Data Platform 2

Photo credit https://www.pexels.com/@vishal-shah-1238477

This is the third article in the series of building the Modern Data Platform. The first article was on understanding the modern data platform and its objectives. The second article focused on challenges of lift-and-shift approach and benefits of cloud elasticity. In this article, I cover many other aspects of the modern data platform in the cloud.

To quickly refresh why lift-and-shift is often not a good approach and why we want to use cloud elasticity, consider this example. An on-prem Data Lake cluster runs on 100 nodes similar to AWS r5d.4xlarge. With the basic lift-and-shift approach, 100 nodes of this instance will cost you $1M per year. If you only use this cluster during 8 business hours, you can save $750,000 per year when using cloud elasticity properly.

In addition to lift-and-shift, there are many other pitfalls with utilizing the cloud. Organizations usually and rightly so provide significantly more freedom to its employees to the cloud infrastructure as compared to the on-prem infrastructure. However, it sometimes leads to a non-efficient cloud usage, such as leaving compute instances running without a need, making countless copies of data, and many others. A proven way to approach this challenge is to implement cloud governance and ensure that expenses can be attributed down to specific teams or projects and that the teams or projects are accountable with their reporting metrics. A governance lifecycle framework is another great way to review projects and provide feedback to the teams. Utilizing data federation and data virtualization tools combined with data discovery tools will further reduce the need for copying data and running ETL processes.

Choice of tools and their implementation on the cloud is also a critical factor. For example, the open source tools will not become ephemeral or elastic and scale with the workload automagically. The Data Platform must include some sort of a control plane and ensure that these tools are managed efficiently. I will write a dedicated article on tools and vendor selection, please stay tuned.

As with any other project, it’s important to design before build and it’s as important to not overengineer. Clouds provide many tools that can be used to do the same job in different ways. Don’t get stuck with analysis paralysis and do not create countless levels of indirection to isolate from vendors and tools. It is not only important to choose the right tool but also use it in the most simple way possible. There are some very complex solutions out in the wild for simple requirements. They are difficult to understand and to manage. The code and design reviews will help to address this challenge. Having expertise on your team for all your key tools is absolutely critical. Don’t be shy and hire consultants to fulfill the gaps in expertise.

The cluster elasticity will be limited by characteristics of the actual jobs. It’s the best practice to design, develop, and deploy your code and data in a way that enables linearly distributable processing. For example, well developed code that takes 1 hour to run on a 100 node cluster, should take 30 minutes on a 200 node cluster. It will allow you to accommodate for any SLA and help with implementing critical path in a data processing pipeline, a challenge that you will face sooner or later. While creating a well distributable job, you might have to overcome significant challenges, such as data skew. Again, I will dedicate a separate article to elaborate on this topic.

It’s very important to put in place a process for productizing the code. As a general practice, most organizations implement CI/CD and the code sometimes goes into production right from a desk of a developer, whether it’s a data engineer or a data scientist. Too often, code review and code optimization is not given sufficient consideration. My team was assisting one of my customers and as a part of the Data Platform Health Check I reviewed a Spark-based ETL process that was executed hourly. After spending two hours on optimizing Spark job configuration, I was able to reduce the cost of this job by $1M a year! I had similar experience with other customers. I hope it’s convincing enough to establish a proper productizing process. I will cover optimization in one of the future articles.

Security is another aspect of Governance that covers the cloud as well as the data platform. Many studies found the cloud to be more secure than corporate data centers. However, security requirements on-prem and on the cloud can be different. I have seen security folks transferring technical security requirements that were developed for on-prem solutions to the cloud. These requirements are not always transferable and sometimes pose a significant limitation on cloud adoption and on choosing vendors. Make sure that your security experts understand why certain technical security requirements are in place and if they are applicable in the cloud environment. Many times, your vendor might be helpful with the security review.

One of the common security attacks is to acquire AWS keys via github or other sources and launch 100s or 1000s nodes on your account for bitcoin mining or other similar purposes. It’s amazing how many companies fell into that trap. An easy measure is to use cloud limits and reasonably limit IP space on the subnets to minimize a damage in a case of this breach.

Data security is another aspect of Governance on the modern data platform. While it can stop some organizations from going to the cloud due to regulations, internal beliefs, or superstition, it usually can be achieved with proper use of cloud security capabilities, good governance process, and tools like Apache Ranger. However, keep in mind that data governance management should be centralized and still cover all tools that comprise your Data Platform.

In today’s global economy, it’s not unusual to have data on the cloud in more than one region. I wrote an article on some best practices with Multi-Region implementation.

Automation is King on the cloud. The cloud infrastructure was designed from ground up to be programmatically managed and automated. Automate code deployment, job schedules, auto-scaling, enforcement of governance rules, etc. Kubernetis is a great new tool available on any cloud. Without going into too much details, it allows you to deploy your stack on any cloud which reduces switching cost and your dependency on a specific cloud vendor.

It seems to be common sense that the cloud has unlimited resources. However, it’s a myth. Cloud vendors, such as AWS, must manage resources in a smart way to make their cloud profitable and it means that resources on the cloud are limited. I have seen many customers hitting actual cloud capacity limits in various regions. The best way to address this challenge is to make your clusters able to utilize heterogeneous resources. For example, you can make clusters use compute instances of various families and acquire those resources that are available.

Spot instances are one of the most misunderstood cloud capabilities. In some cases, spot instances can reduce your cost by 80% when used wisely. Make sure to mix them with standard on-demand instances, allow heterogeneous clusters, and utilize spot instances for processes that are not bound by extremely strict SLAs.

Another pitfall that I have seen often is transferring the traditional mindset to the cloud. It’s sort of a variation of lift-and-shift. For example, setting up cluster-local HDFS and using it for multi stage processes that depend on the data state in that HDFS makes this cluster a non-ephemeral resource that cannot be brought down. Cost is one outcome of that. Another outcome is inability to update the software on that cluster as your vendor most likely did not accommodate for non-ephemeral usage. It’s not that rare to see a cluster that has been up for over a year.

Establishing a Center of Excellence with information on stakeholders, various best practices, information sources, communication plan, latest news, projects, schedules, and much more is a great way to enable your team on the cloud fast and efficient. CoE is not only a web portal, it’s a team of people and a set of processes. CoE is a very simple, inexpensive and efficient tool that is very often overlooked.

Do not discard your existing on-prem data platform just yet. Keep using it for the appropriate use cases and consider it as a part of the modern data platform that you are building. However, do not let the sunken cost of the past investment to prevent your organization from moving to the Cloud and jeopardize the future of your business.

As always, I hope this article was informative. Whether your experience correlates with mine or not, please share your thoughts. The next article will be about the Tools and Vendors Selection. Stay tuned!