Building a Data Science Platform with Kubernetes

How Kubernetes — the back-end tool — powers the data science team with end-to-end ML life-cycle from model development to deployment

When I started in my new role as Manager of Data Science, little did I know about setting up a data science platform for the team. In all my previous roles, I had worked on building models and to some extent deploying models (or at least supporting the team that was deploying models), but I never needed to set up something from scratch (infra, I mean). The data science team did not exist then.

So first of my objective was to set up a platform, not just for the data science team in a silo, but that can be integrated with data engineering and software teams. This is when I was introduced to Kubernetes (k8s) directly. I had heard of it earlier but hadn’t worked beyond creating docker images and someone else would deploy in some infra.

Now, why is Kubernetes required for the data science team? What are some of the challenges faced by data science teams?

A scalable computer based on requirement — as a data scientist we work on different problems every day and each has different resource requirements. There isn’t a one-size-fits-all computer. Even if it exists, it can’t be given to everyone on the data science team
Version issues — Python and package version issues when working in a team or when we deploy to production
Different technologies and platforms — some pre-processing and model building require spark, and some can be done in pandas. So again, there isn’t a one-size-fits-all in local computer
Sharing work within the team — Sharing and tracking of model results done in an Excel spreadsheet and circulated after each iteration
And most importantly, Production deployment — how do I get the finished model to production? Models don’t get to production for real-time use cases, as we as data scientists are not aware of building API/system around a model. Eventually, we end up running the model score in batch

I’ve explored solutions, including Cloud Platform solutions (AWS SageMaker, GCP AI Platform, Azure Machine Learning), but our main factor is cost and next is cloud-agnostic. If cost is not a factor, then one can use the above-mentioned cloud platform services.

We identified that Kubernetes is an ideal platform that satisfies most of these requirements — to scale and serve containerized images. Also this way, we are cloud-agnostic. If we have to move to a different vendor, we just lift and shift everything with minimal changes.

Many tools provide complete/similar solutions like KubeFlow, Weights & Biases, Kedro, …, but I ended up deploying the below 3 services as the first version of the data science platform. Though these don’t provide the complete MLOps framework, this gets us started to build the data science platform and team.

JupyterHub — Containerized user environments for developing models in interactive Jupyter Notebooks
MLflow — Experiment tracking and storing model artifacts
Seldon Core — Simplified way to deploy models in Kubernetes

With these 3 services, I get my team to build models including big data processing in JupyterHub, track different fine-tuned parameters, and metrics, and store artifacts using MLflow and serve the model for production using Seldon-Core.

JupyterHub

Deploying this was the trickiest of all. JupyterHub in a standalone setup is easy compared to Kubernetes installation. But most of the required configuration was available here —

Zero to JupyterHub with Kubernetes

Since we want to use Spark for some of our data processing, we created 2 docker images —

Basic Notebook — extended from jupyter/minimal-notebook:python-3.9
Spark Notebook — extended from above with additional spark setup.

Code for these notebook docker images and helm values for installing JupyterHub using these docker images are available here.

GitHub – avinashknmr/data-science-tools

There are a lot of tweaks done to enable Google Oauth, starting Notebook as a root user, but running them as an individual user, retrieving the username, user-level permissions, persistent volume claims, and service accounts, … which took me days to get it working, especially with the Auth. But this code in the repo, can give you a skeleton to get started.

MLflow

Setting up MLFlow was easy.

What is MLflow?

MLflow offers model tracking, model registry, and model serving capabilities. But for model serving, we use the next tool (Seldon-Core).

Build a Docker image with the required Python packages.

FROM python:3.11-slim

RUN pip install mlflow==2.0.1 boto3==1.26.12 awscli==1.27.22 psycopg2-binary==2.9.5

EXPOSE 5000

Once the docker image is created and pushed to the container registry of your choice, we create a deployment and service file for Kubernetes (similar to any other docker image deployment). A snippet of the deployment yaml is given below.

containers:
- image: avinashknmr/mlflow:2.0.1
  imagePullPolicy: IfNotPresent
  name: mlflow-server
  command: ["mlflow", "server"]
  args:
  - --host=0.0.0.0
  - --port=5000
  - --artifacts-destination=$(MLFLOW_ARTIFACTS_LOCATION)
  - --backend-store-uri=postgresql+psycopg2://$(MLFLOW_DB_USER):$(MLFLOW_DB_PWD)@$(MLFLOW_DB_HOST):$(MLFLOW_DB_PORT)/$(MLFLOW_DB_NAME)
  - --workers=2

There are 2 main configurations here that took time for me to understand and configure —

artifact’s location
backend store

The artifact location will be a blob storage where your model file will be stored and can be used for model-serving purposes. But in our case, this is AWS S3 where all models are stored, and is a model registry for us. There are a couple of other options to store the model locally in the server, but whenever the pod restarts the data is done, and PersistentVolume is accessible only via the server. By using Cloud Storage, we can integrate with other services — for example, Seldon-Core can pick from this location to serve the model. The backend store stores all metadata required to run the application including model tracking — parameters and metrics of each experiment/run.

Seldon-Core

The second most trickiest of the three is Seldon-Core.

Seldon-Core is like a wrapper to your model that can package, deploy, and monitor ML models. This removes the dependency on ML engineers to make the deployment pipelines.

GitHub – SeldonIO/seldon-core: An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models

We did the installation using a Helm chart and Istio for ingress. There are 2 options for ingress — Istio & Ambassador. I’m not getting into setting up Istio, as the DevOps team did this setup. Seldon is installed with the below Helm and Kubectl commands.

kubectl create namespace seldon-system
kubectl label namespace seldon-system istio-injection=enabled

helm repo add seldonio https://storage.googleapis.com/seldon-charts
helm repo update

helm install seldon-core seldon-core-operator 
    --repo https://storage.googleapis.com/seldon-charts 
    --set usageMetrics.enabled=true 
    --set istio.enabled=true 
    --set istio.gateway=seldon-system/seldon-gateway 
    --namespace seldon-system

But assuming you have Istio set, below is the Yaml to set up Gateway and VirtualService for our Seldon.

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: seldon-gateway
  namespace: seldon-system
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "*"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: seldon-vs
  namespace: seldon-system
spec:
  hosts:
  - "*"
  gateways:
  - seldon-gateway
  http:
  - match:
    - uri:
        prefix: /seldon
    route:
    - destination:
        host: seldon-webhook-service.seldon-system.svc.cluster.local
        port:
          number: 8000

Below is a sample k8s deployment file to serve the iris model from GCS. If using scikit-learn package for model development, the model should be exported using joblib and named as model.joblib .

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: iris-model
  namespace: prod-data-science
spec:
  name: iris
  predictors:
  - graph:
      implementation: SKLEARN_SERVER
      modelUri: gs://seldon-models/v1.16.0-dev/sklearn/iris
      name: classifier
    name: default
    replicas: 1

In this example, we use SKLEARN_SERVER, but it has integrations for MLFLOW_SERVER, and TF_SERVER for MLflow and TensorFlow respectively.

Seldon-Core not only supports REST API but also gRPC for seamless server-server calls.

Conclusion

These tools are open source and deployable in Kubernetes, so they are cost-effective for small teams and also cloud-agnostic. They cover most challenges of a data science team like a centralized Jupyter Notebook for collaboration without version issues and serving models without dedicated ML engineers.

JupyterHub and Seldon-Core leverage the Kubernetes capabilities. JupyterHub spins up a pod for users when they log in and kills it when idle. Seldon-Core wraps the model and serves it as an API in a few minutes. MLflow is the only standalone installation that connects model development and model deployment. MLflow acts as a model registry to track models and store artifacts for later use.

Building a Data Science Platform with Kubernetes was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Originally appeared here:
Building a Data Science Platform with Kubernetes

Go Here to Read this Fast! Building a Data Science Platform with Kubernetes

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Building a Data Science Platform with Kubernetes

How Kubernetes — the back-end tool — powers the data science team with end-to-end ML life-cycle from model development to deployment

JupyterHub

MLflow

Seldon-Core

Conclusion

More posts

Red Hat bets big on AI with its Neural Magic acquisition

How many software updates does the OnePlus 13 get?

The best air purifier for 2025

UK Government launches ransomware protection proposals