August 9, 2023

The Cloud Data Platform Pendulum

By Mikhail Stolpner · 6 minute read

Hadoop Elephant As anything else in life, the cloud is also under The Pendulum Law of nature or as Gartner likes to put it under Gartner Hype Cycle. That makes the Modern Data Platform a moving target. In this article I will cover the tendencies, challenges, and best practices of the Modern Data Platform. I welcome you to disagree or add your knowledge, experience, and educated opinion in the comments to the article or just shoot me an email.

After migrating to the cloud, many of us learned that there are advantages and disadvantages to the cloud. If used wisely, it’s a valuable tool that enables innovation and expedites time to market. However, it can become an expensive burden and a barrier for certain use cases. While learning the downside of the cloud infrastructure, many architects and managers started looking back into their own data centers for an alternative.

We at UCE Systems have been leading data platform modernization for our customers through implementing Modern Hybrid Data Platforms. Below, I share some of our valuable experiences.

Two Sides of Two Coins

Honeymoon. A typical cloud adoption journey starts from a blue-sky CEO mandate of moving to the cloud at any cost. Cloud vendors are very accommodating and give away large credits. We happily accept it and are excited about breaking out of our on-prem data center limitations that constrain our innovation.

We start moving our data and our workload to the cloud. Having heard horror stories of how expensive the cloud can be we opt off the lift-and-shift approach, follow best practices, and decouple data and compute.

Hang Over Morning. However, the cloud vendor credits dry out in just a few months, and we get the first large bill. Our CFO makes us move to the next cloud adoption phase – get our cloud costs under control. We hire a vendor who helps us successfully implement a holistic approach of cloud cost management. Now, we procure cloud infrastructure with cost in mind and can even do charge backs to justify our cloud expenses.

Maturity. Despite all our efforts, our cloud expenses are still high, and we are moving to the next phase – workload optimization. It turns out to be a very complex issue to tackle due to lack of required talent and organizational structure. We make some progress, establish CI/CD with a cost control gate, but realize that most of the value is still on the table and it will take time to grind through the existing workload.

Wisdom. While we made significant progress in cloud cost management journey, the economy took a turn and took our budget with it. At this point we look back at our own data center and it does not look that bad anymore. It provides a significant pool of compute and storage. And that brings us to the next phase of our cloud adoption journey – Hybrid Data Platform.

We need to decide what is going back to the data center and what is staying on the cloud. It’s a simple exercise. Cloud and on-prem infrastructure have distinct differentiation. The on-prem infrastructure is inexpensive but has a static nature as it has a limited pool of resources and it takes a lot of time to procure resources. Cloud, on the other hand, is costly but has a large pool of resources readily available, i.e., it’s scalable. Now we need to match our workload to either cloud or on-prem capabilities. If our workload has a pattern similar to those depicted below, it’s a good match for the cloud. Otherwise, it’s a good match for on-prem infrastructure.

A picture containing screenshot, plot, text, line

Description automatically generated

For example, an Apache Spark ETL job that takes 20 min on 200 node cluster is a great candidate for keeping it on the cloud. Implementing a new proof of concept or a customer validation application is another great use case for the cloud. Both these use cases will utilize infrastructure when needed and can easily scale out to satisfy demand. The customer validation application also benefits from getting to market much faster on the cloud. However, running such type of workload on-prem could be challenging from cost and SLA perspective. It could also jeopardize our go-to-market strategy.

Running a data analytics cube, in-memory database, or a data analytics cluster with relatively steady workload better fits on-prem infrastructure.

Challenges

Now that we wisely decided to modernize our Data Platform into a Hybrid Data Platform, we need to understand the challenges and have a roadmap. We also need to make sure that we can move applications, tools, and data from the cloud to our data center and back easily.

The very first challenge for implementing Hybrid Data Platform is incompatibility of the on-prem infrastructure and the cloud. For example, we cannot stand up an EMR Cluster on-prem. We cannot run a Databricks instance in our data center either. On the other hand, our on-prem HDFS data source is not compatible with S3 service.

Data Location is another challenge. Do we keep 100PB of data on-prem or on the cloud or on both? Do we need to keep a part of the data on-prem and part on the cloud? How do we keep the data in sync then?

While our on-prem solution was designed to take advantage of co-location, we had to decouple data and compute when migrated to the cloud. Now that we decided to bring some workload back to our data center, we found that it has never been designed for such use cases and its network cannot support large, decoupled workloads.

Best Practices

Streamlining tools, CI/CD, and code is one of the core objectives of the Hybrid Data Platform best practices.

Cloud infrastructure provides us with Infrastructure-as-a-Service (IaaS) capability and allows us to stand up the entire solution no matter how complex it is with a single click of a button. It would be great to start by implementing a Private Cloud in our data center with a similar capability. However, it will be a heavy undertaking that would take a very long time. On top of that, it will be impossible to implement some of the cloud services on-prem. While we still need to start this initiative, we also need a “shortcut” to expedite time to value for our Hybrid Data Platform.

Cloud provides an enormous variety of services. We used a good number of them when we migrated to the cloud. Now it’s time to reduce services to those that we can also support on-prem. This approach allows us to streamline our code base and CI/CD and provides an ability to migrate our workload to our data center and back to the cloud in a quick and easy way without code change.

Thankfully, the variety of core services that most Data Platforms need can be surprisingly small. It can really be down to just a few pillars: MinIO for storage layer, Kubernetes for compute layer, Apache Spark and Dremio for workload:

At the very core of any Modern Data Platform is a data layer, usually implemented with object store such as S3. MinIO is a great option providing such service on-prem that we had recently proven at one of our projects. Not only did it prove to be a scalable and performant tool, the MinIO team was a real enabler that helped us to discover deficiencies in the infrastructure.

Compute is another mandatory part of any Modern Data Platform. Kubernetes can easily abstract the compute of your Data Platform from underlying infrastructure and streamline compute across your data center and any cloud.

For Kubernetes deployment, Helm is a real enabler. Using Kubernetes without Helm is as challenging as using cloud without templates. Fortunately, Helm can work with different flavors of Kubernetes on-prem and on the cloud.

Apache Spark is an indispensable tool for ETL, ML and many other use cases. Apache Spark can run on on-prem Kubernetes or cloud-native Kubernetes allowing for streamlined CI/CD.

Dremio is another indispensable tool for Hybrid Data Platform that provides a spectrum of Data Analytics capabilities. Dremio can also run on Kubernetes.

While cubes are often at the core of data analytics, they’re also often an overlooked part of the modern data platform. Probably because it is often perceived as an outdated technology and an availability of tools capable of implementing it as a part of the Data Lake. Fortunately, there is a tool from ActiveViam that is cloud-agnostic and can also be deployed on-prem.

One of the main constraints of most data centers is networking. Our recent Dremio implementation reads petabytes of data to serve hundreds of thousands of queries a day. While easily supporting workload of that scale with co-located Hadoop architecture, data centers have never been designed for modern decoupled data and compute architecture and cannot handle network traffic of that scale. We developed a solution that allows us to overcome this obstacle and repurpose existing hardware that is already available in the data center. We will be publishing a different article dedicated to this topic. Feel free to reach out for more information.

Finally, streamlining CI/CD for infrastructure and code across the entire Hybrid platform is a key success factor.

SaaS

I decided to dedicate a separate section for SaaS not because I need to cover a lot of information, but because of how important it is.

There is a reason why SaaS offerings are losing its “cool kid” appeal. No matter how good the first impression is, most of SaaS-based solutions will be locked to a cloud-only option and will be a significant obstacle to streamline the infrastructure, tools, and CI/CD and extend capabilities of the Data Platform with other tools.

Roadmap

Here is a short list of items we normally need to have in our Roadmap:

Design cloud-ready on-prem infrastructure accommodating for data center constraints.
Plan for implementing IaaS on-prem long term.
Select a limited number of tools capable of supporting streamlined hybrid architecture.
Decide on what workload to keep on cloud and what on-prem.
Select low hanging fruits – most expensive workloads that are reasonably simple to migrate.
Ensure workload cost optimization is a part of infrastructure and code CI/CD.
Develop migration plan.
Implement first use cases.
Close the governance loop by lessons learned from the experience.

About UCE Systems Corporation

UCE Systems Corp. is a professional services and consulting firm focused on data analytics platform engineering projects. Modern environments typically are described as on-prem, cloud or hybrid cloud, although we at UCE think of “cloud” as an operating model and an accelerator, not a location. We have decades of experience in this area. Our focus is on working with you on developing a synergistic data platform strategy with consideration to your current architecture and tech investments and alignment with your strategy to accelerate time to value and future proof against a dead-end architecture. We have deep expertise in implementing and managing these solutions and we are also proficient in many of the cloud enabled technologies such as Dremio, Databricks, MinIO and others. Our onshore/offshore business model allows us to be agile and competitive and we can structure teams based on your requirements.