Skip to content
All posts

Modern Data Platform: Tool Selection

Modern Data Platform Tool Selection

The selection of tools today is the largest it has ever been, and the market space is very dynamic with new tools popping up almost every day. It’s impossible to consider all of them with any level of depth. One needs a framework to understand what tools need to be evaluated. I suggest starting from the functional view of your data platform. It may look similar to the diagram below. With the Storage and the Workloads separated, this diagram is based on the principle of decoupling data and compute. As we discussed in the previous article, this decoupling is the key to success with your cloud-based modern data platform. 

1566481724302

The data sources may be the IoT devices producing streaming data, OLTP databases with operational data that can feed fraud detection workloads or provide feedback to the machine learning models that are trying to detect behavioral patterns on your website or your mobile App. Your vendors may supply data in daily batches with various events from your marketing campaigns or with some reference data. This list of potential sources can go on and on. It’s important to understand what your data sources are, their nature, technology, integration mechanism, and priority for your business.

The Data Consumers can also vary significantly. Business people and Data Analysts would like to consume data with tools like Tableau, Power BI, or Excel. Data Scientists are likely to use tools like R, Python, or others capable of running their Machine Learning (ML) models using various frameworks, such as TensorFlow and XGBoost. Your websites might need to consume ranking produced by serving pre-trained ML models. Fraud Detection consumers need timely information to be able to raise a useful alert. Marketing people are probably the ones that don’t let you sleep at night as they need near real time data from all data sources to drive consumer behavior.

1566482676913

The Data Platform is an environment in between data sources and data consumers that must be able to ingest all this variety of data at scale and produce desirable output data in the shape and time that is acceptable for consumers. The white blocks on the diagram is a set of capabilities that will enable data platform to do this job. The selection of tools should be such that it matches these functional blocks.

Now, that we know what we need to put in place and can start searching for the tools that can do the job in the most efficient way, I could have finished this article right here and do not impose on you my bias towards specific tools. However, I would like to list a few tools not only as an example, but also because I had exceptionally successful experience with them on many projects.

Let’s start from Storage. Remember, the data must be decoupled from compute. With a cloud-based data platform, the choice of storage is fairly simple. It’s S3 on AWS, ADLS on Azure, and GCS on the Google Cloud. For Hybrid Data Platforms and for high performance use cases, MinIO is likely going to be at the core of your data platform.

Apache Spark is often a tool of choice for ETL and Data Science. It’s extremely flexible and tunable engine when used properly. It allows you to use SQL, Python, R, and Scala as programming languages. It can be easily integrated with Jupyter Notebook that came out as a data scientist tool and quickly became popular among data engineers as a development tool. Apache Spark has a modern, highly scalable architecture. It can consume a variety of data formats, and it’s highly extensible. With its powerful architectural capabilities, Spark is quickly replacing a variety of tools from Apache Hadoop ecosystem.

1566481933859

Let’s review a practical scenario that involves ML model training and model serving to produce marketing campaigns that is depicted on the diagram.

The input data comes from a vendor that tracks Clicks and Views generated by the marketing email in CSV files stored on S3 buckets. Information on viewing various website pages and adding products to the shopping cart comes in ORC files via legacy on-prem HDFS. Finally, data on actual orders and customer profiles comes from Oracle database. All this information allows data scientists to build and train various ML models with TensorFlow and XGBoost libraries running on Apache Spark cluster. Data scientists save trained models on S3 buckets for the CI/CD. These models are consumed by data engineers and ETL processes. Eventually, the ETL processes generate new marketing campaigns based on the ranked profiles for consumption by the marketing email vendor. Both, data scientists and data engineers use Jupyter Notebooks to develop their code. However, while data scientists prefer Python (PySpark) and R, data engineers are likely to use Scala. Note, that the entire process is supported by the Apache Spark cluster. 

While Spark can also be used for ingesting streaming data or as a SQL Query engine for data analytics, there are many other excellent options on the market. 

The data platform will likely need a Metastore to keep information about your data which is also known as metadata. Without a metastore, the data platform will be limited to a few specific data formats. For example, in the scenario above, Apache Spark would not be able to process ORC files without a metastore. While Apache Hive seems to be slowly displaced from the modern data platforms, the Hive Metastore is widely used as a metastore of choice and Apache Iceberg is gaining popularity due to its powerful capabilities. Interoperability is one of the metastore core capabilities. When choosing a tool for the metastore, make sure that it’s not locked by the vendor and can be integrated with all tools that you choose for your data platform.

Dremio is a SQL Query engine for the modern data platform that provides high performance while working with datasets at scale. It covers Data Federation and Data Virtualization scenarios. It provides Data Discovery, Data Lineage, and Data Governance capabilities. However, Dremio is not an ETL tool and it rather compliments than competes with Apache Spark.

Let’s review a typical scenario of business analytics platform in motion when some data must be migrated from on-prem HDFS to Cloud-based data lake. In a traditional environment, when Tableau or any other visualization tool is connected to the data source directly, without relying on data virtualization tools, moving a data source is a very impactful event. As depicted in the diagram below, the data consumers will need to update thousands of their reports and dashboards. You might also need a new data processing engine to serve SQL queries against data on the cloud data lake.

1566482097299

 

With data virtualization tools such as Dremio, the impact of moving data sources can be reduced to minimal. As depicted in the diagram below, the only impacted layers will be virtual tables defined in Dremio. Note that moving data source has no impact to reports, dashboard, and data consumers when using Dremio’s Data Virtualization and Data Federation capabilities.

1566482192833

Together, Apache Spark, Hive Metastore, and Dremio may cover much of the modern data platform core functionality.

It’s critical to use open-source tools for the Modern Data Platform as it would allow you to reduce your chances of vendor locking and implement multi-cloud or hybrid approach. Most of the modern tools including Apache Spark, Hive Metastore, and Dremio are open source. If creating data platforms is not a part of your core business, working on implementing open-source tools from scratch will distract you from the core business and consume valuable resources. Instead, it would be wise to develop a weighted criterion for various areas and functions that are important to you and make a short list of vendors providing the tools that cover functional blocks in your diagram. That would help you choose tools and vendors that fit your needs in the most efficient way.

Don’t settle for the big data or data warehousing tools available out of the box from your Cloud vendor, such as AWS or Azure. There are other, probably, much better options. Running a POC will likely help you find clear winners and enable successful implementation and growth of your modern data platform.

Other articles in this series:

Characteristics, Whats, and Whys of the Modern Data Platform

Modern Data Platform and the Cloud. Part 1

Modern Data Platform and the Cloud. Part 2