Migrating from Hadoop to a Cloud-Ready Architecture for Data Analytics
This post was a collaboration between Kevin Lambrecht of UCE Systems and Raghav Karnam
The cloud operating model and specifically Kubernetes have become the standard for large scale infrastructure today. More importantly, they are evolving at an exceptional pace with material impacts to data science, data analytics and AI/ML.
This transition has a significant impact on the Hadoop ecosystem. While there is considerable credit to that ecosystem for ushering in the era of big data - the architecture underlying it is no longer relevant. New architectures have replaced it and those are continually evolving - requiring enterprises to upgrade as well to remain competitive. The Hadoop ecosystem had its good run as part of an evolving phase, and the time has come to re-evaluate how the systems are built and are adapting to the fast-changing landscape of compute and storage, processing, and querying in a cloud-native and Kubernetes-native way.
This post focuses on how to architect a hybrid cloud strategy that leverages existing Hadoop hardware - something many enterprises have hanging around. While not ideal, particularly for high-performance use cases, this approach can serve as a bridge to a cloud-native architecture and a foundation for more ambitious deployments.
The cloud operating mode defines today's data analytics landscape. Organizations are increasingly adopting the principles of the cloud, whether it's a Software-as-a-Service (SaaS) data platform for turnkey analytics or leveraging cloud infrastructure and software tools for efficient scalability.
Multi-cloud and hybrid-cloud compatible data infrastructure is effectively a requirement for the enterprise. This approach ensures a consistent and portable interface for accessing data and applications, allowing them to be seamlessly deployed across various environments without needing code modifications. As a result, organizations gain the flexibility to run their workloads anywhere, ranging from edge devices to public cloud environments and private cloud deployments, while maintaining a consistent user experience and operational efficiency.
Like any technology change, however, the cloud-native approach raises several critical considerations and challenges that must be addressed.
- Data Governance: Organizations must determine which data is permissible to be stored in the cloud and if all data can be migrated there.
- Migration Approach: Choosing between a "lift and shift" methodology or a comprehensive data platform restructuring requires careful evaluation of the impact on existing systems and applications.
- User Base: The transition to the cloud has implications for user access, permissions, and overall user experience, necessitating careful planning and communication.
- Disaster Recovery: Ensuring adequate disaster recovery control becomes crucial. Organizations must assess whether they can trust the cloud vendor's capabilities or if additional measures are required.
- Migration Timeline: The migration process involves multiple stages, including data transfer, application testing, and user acceptance testing. Understanding the timeline for each phase is essential for effective planning.
- Existing Data Lake Infrastructure: Organizations must evaluate the fate of their investments in the current Data Lake infrastructure and determine how it aligns with the new cloud-based architecture.
- Data Pipeline Adjustments: The migration may require modifications to data pipelines, applications, ETL processes, and scripting to ensure seamless integration with the cloud environment.
- Cost Management: Adopting a "pay-as-you-go" model in the cloud requires careful cost control strategies to avoid unexpected expenses. Organizations should analyze long-term cost implications and plan accordingly.
- Cost vs. Benefit: It is essential to assess whether the challenges and investment of person-hours in the migration process result in cost savings and overall benefits for the organization.
For organizations heavily invested in the Hadoop Data Lake model, a direct shift to the cloud remains a complex and time-consuming project with technical and non-technical challenges.
However, an alternative approach exists, adopting a hybrid cloud strategy that leverages existing Hadoop hardware. By combining on-premises infrastructure with cloud capabilities, organizations can achieve a smoother transition while circumventing many challenges mentioned above.
Implementing a hybrid cloud approach allows organizations to embrace the modern trends in data analytics architecture while capitalizing on their current investments. It provides a pragmatic solution that balances the benefits of the cloud with the need for a gradual and manageable transition.
Case Study: Streamlining a Financial Institution's Data Architecture
A major financial institution recognized the need to simplify operations and reduce costs associated with its large Hadoop cluster. With the technology becoming outdated and costs escalating as the industry moved away from Hadoop towards more fragmented technology stacks, the institution faced numerous challenges. Although transitioning to the cloud seemed like a viable solution, internal enterprise requirements and regulatory requirements hindered progress, leaving the feasibility uncertain.
In the existing environment, multiple query tools were employed, with Dremio serving as the core query engine to decouple compute and storage in Hadoop (HDFS). Dremio proved to be a suitable option, offering impressive performance against Hadoop with its compute engine coexisting as a Yarn-based application within the Hadoop ecosystem. This architecture allowed Dremio's internal metadata collection to route queries to Hadoop Worker nodes that contained the relevant data locally. By leveraging massively parallel processing on Hadoop commodity hardware, which comprised low-cost spinning disks, limited network speeds, and moderately sized servers, exceptional speeds could be achieved when adequately architected. However, transitioning to an architecture with decoupled data and compute required faster drives, high-speed networks, and more powerful servers to maintain comparable performance.
The first challenge involved replacing the HDFS storage in Hadoop. Considering that object-based storage options such as AWS S3, Azure ADLS, and Google GCS were preferred in the cloud, the financial institution focused on object-based and S3-compatible storage. After testing various options, they chose MinIO as the storage platform for several reasons, including the cost of ownership, stability, and the crucial factor that it could run on Hadoop hardware. This last aspect significantly reduced the overall cost of migrating to this modern data storage platform.
Architecturally, when migrating off of Hadoop, it is critical to envision compute and storage segregation to not only be able to choose the hardware appropriate for the workloads but also to plan for the elasticity of workloads. While compute needs are highly elastic, it is essential to plan for massive expansion and contraction for burst workloads and compute spin-up for need-based workloads giving a massive financial incentive purely for economic reasons. Storage needs are almost near linear expansion, which should allow for 100TB to an exabyte horizontal scalability, and that is where MinIO as object storage shines deployable on commodity hardware.
Although the hardware was suboptimal for S3 storage due to commodity spinning disks (average 180 Mib/s read and write) and 20 Mib/s network speed, MinIO's storage architecture enabled massively parallel reads and writes across the four-node server with 24 5.5TB drives, resulting in significantly higher throughput.
MinIO recommends that the storage be explicitly deployed on storage-optimized servers, enabling that compute segregation to grow from 100TB to Exabyte scale. However, in this case study, we needed an out-of-the-box solution to the given networking challenges.
The final piece of the puzzle was co-locating two Dremio worker nodes on each of the four servers running MinIO, with the Dremio data endpoint set as the local MinIO node. While this setup was suboptimal compared with a modern data platform architecture, it offered several advantages:
- Significantly reduced network traffic between Dremio and MinIO.
- Enabled multiple threads of data requests (32 CPU threads per Dremio node) to a local MinIO endpoint.
- More efficient utilization of CPU, memory, and available network bandwidth.
MinIO's minimal processing footprint allowed Dremio to leverage most of the local node's computing power (CPU) and memory. However, the Dremio worker nodes were ring-fenced to 16 CPU cores and 128 GB of RAM, leaving the remaining resources sufficient for MinIO and the operating system to function without performance degradation. Consequently, performance metrics demonstrated at least parity in CPU/memory requirements (Dremio/Hadoop vs. Dremio/MinIO) and an overall 20% improvement in clock time performance.
It is important to note that the provisioning time for Hadoop in Yarn was the differentiating factor. Since the compute was dedicated to MinIO and Dremio, Hadoop setup time was minimized. Additionally, query planning time was reduced, as Dremio did not need to consider data locality with MinIO as it did with Hadoop/HDFS.
All of the above achievements were made without any other performance tuning on the nodes or query modifications between Hadoop and MinIO.
The existing Hadoop hardware, intended to be migrated to MinIO, lacks the NVMe drives necessary for Dremio to leverage their Cloud Columnar Cache (C3). In our future endeavors, we will explore the implementation of the C3 cache, which has demonstrated the ability to service up to 80% of queries through this disk cache layer. This optimization significantly enhances speed by reducing Disk I/O and network traffic.
Additionally, we will conduct tests to assess the feasibility of co-locating other applications, such as Spark, on the same or additional MinIO nodes.
MinIO does not recommend this approach as a standard and this should be treated as one-off to offset the network throttling limitations with an enterprise. Ideally, the organizations are better off solving these network glitches that account for degraded performance across all the applications in the organization.