Kubernetes native BigData platform with S3

There is a wide range of tools which allows the analysis of BigData, like Hadoop or other NoSQL databases like MongoDB. All have in common that they are providing functionality for preparing (enrichment, analysing) and persisting data for exploration and visualization.

With the usage of cloud services like AWS, other cloud based persistent layer are used to store the data. Amazon S3 (SimpleStorageService) in AWS is one of these newly used layer. Nowadays, there are more provider of S3-compatible storages apart from AWS. To list some of them we see Cloudian and Ceph to provide private (self-hosted) S3-compatible storages.

S3 is web-based, managed, extremely stable, secure and cheap and easy to connect (simply per https-request). You don’t need a file system adapter like the one you need to connect a network storage or other storage devices. This allows loose coupling of the storage of the objects from the application which uses the objects.

The project: BigData platform with Superset, Trino, HIVE Metastore and Druid

Our customer has S3 as a self-hosted object storage for their BigData. They needed a tool to explore and to visualize the data in their S3-buckets. We developed an BigData-Platform based purely on OpenSource components like

Superset for exploration and visualization,
Trino together with
HIVE Metastore to process data requests and storage against S3 or other data sources and
Druid for data analytics in S3.

We recommended to use a managed Kubernetes to run the platform. The S3-buckets where connectable from within Kubernetes. All files were in CSV or Parquet format. We developed the CI/CD process to build the Docker Container (HIVE Metastore) and customizable HELM projects (for HIVE Metastore, Druid and Trino) to allow a flexible deployment to connect different data sources.

The result

We provided the platform in Kubernetes successfully. They could create dashboards in Superset to visualization their original data stored in the S3-buckets (and other data sources) for their stakeholder without using Hadoop or other Databases.

If available, we only used original Docker and HELM components for Superset, Trino, … the missing components were created newly. This approach allowed more flexibility in deploying the platform. We heavily customized Superset with a new Layout and an implementation of the Single sign-on process of the company to allow the access of Superset and Trino via SSO for the user. We also rebuild the HIVE Metastore Docker container to allow a connection against the S3-Buckets with company own TLS certificates.

Teh platform shows also, that Kubernetes is optimal to run such a platform which relies on heavy parallelization to run processes. We could easily ramp up the replication of Kubernetes Pods (Trino worker for example) to speed up the processing of the analytics and visualization in Superset dashboards. We used an Ingress Controller (NGINX) in Kubernetes for name routing of the requests and for the https termination. A PostgreSQL was the storage for the application Metadata (Superset, HIVE Metastore).

The customer was more than happy with this innovative platform.

BigData with Kubernetes, Superset and S3

Kubernetes native BigData platform with S3

The project: BigData platform with Superset, Trino, HIVE Metastore and Druid

The result

Need more information? Let's arrange a call.

Jörn Kleinbub

Languages

Categories

Recent Posts

Kubernetes und Sicherheit

OpenAI-Chatbot at www.yotron.de

Erfahrungen mit Kasten K10

Grafana Backup and Recovery

Technologies

Verlassen des Chats? / Leaving Chat?