How Kubernetes — the back-end tool — powers the data science team with end-to-end ML life-cycle from model development to deployment
When I started in my new role as Manager of Data Science, little did I know about setting up a data science platform for the team. In all my previous roles, I had worked on building models and to some extent deploying models (or at least supporting the team that was deploying models), but I never needed to set up something from scratch (infra, I mean). The data science team did not exist then.
So first of my objective was to set up a platform, not just for the data science team in a silo, but that can be integrated with data engineering and software teams. This is when I was introduced to Kubernetes (k8s) directly. I had heard of it earlier but hadn’t worked beyond creating docker images and someone else would deploy in some infra.
Now, why is Kubernetes required for the data science team? What are some of the challenges faced by data science teams?
- A scalable computer based on requirement — as a data scientist we work on different problems every day and each has different resource requirements. There isn’t a one-size-fits-all computer. Even if it exists, it can’t be given to everyone on the data science team
- Version issues — Python and package version issues when working in a team or when we deploy to production
- Different technologies and platforms — some pre-processing and model building require spark, and some can be done in pandas. So again, there isn’t a one-size-fits-all in local computer
- Sharing work within the team — Sharing and tracking of model results done in an Excel spreadsheet and circulated after each iteration
- And most importantly, Production deployment — how do I get the finished model to production? Models don’t get to production for real-time use cases, as we as data scientists are not aware of building API/system around a model. Eventually, we end up running the model score in batch
I’ve explored solutions, including Cloud Platform solutions (AWS SageMaker, GCP AI Platform, Azure Machine Learning), but our main factor is cost and next is cloud-agnostic. If cost is not a factor, then one can use the above-mentioned cloud platform services.
We identified that Kubernetes is an ideal platform that satisfies most of these requirements — to scale and serve containerized images. Also this way, we are cloud-agnostic. If we have to move to a different vendor, we just lift and shift everything with minimal changes.
Many tools provide complete/similar solutions like KubeFlow, Weights & Biases, Kedro, …, but I ended up deploying the below 3 services as the first version of the data science platform. Though these don’t provide the complete MLOps framework, this gets us started to build the data science platform and team.
- JupyterHub — Containerized user environments for developing models in interactive Jupyter Notebooks
- MLflow — Experiment tracking and storing model artifacts
- Seldon Core — Simplified way to deploy models in Kubernetes
With these 3 services, I get my team to build models including big data processing in JupyterHub, track different fine-tuned parameters, and metrics, and store artifacts using MLflow and serve the model for production using Seldon-Core.
JupyterHub
Deploying this was the trickiest of all. JupyterHub in a standalone setup is easy compared to Kubernetes installation. But most of the required configuration was available here —
Zero to JupyterHub with Kubernetes
Since we want to use Spark for some of our data processing, we created 2 docker images —
- Basic Notebook — extended from jupyter/minimal-notebook:python-3.9
- Spark Notebook — extended from above with additional spark setup.
Code for these notebook docker images and helm values for installing JupyterHub using these docker images are available here.
GitHub – avinashknmr/data-science-tools
There are a lot of tweaks done to enable Google Oauth, starting Notebook as a root user, but running them as an individual user, retrieving the username, user-level permissions, persistent volume claims, and service accounts, … which took me days to get it working, especially with the Auth. But this code in the repo, can give you a skeleton to get started.
MLflow
Setting up MLFlow was easy.
MLflow offers model tracking, model registry, and model serving capabilities. But for model serving, we use the next tool (Seldon-Core).
Build a Docker image with the required Python packages.
FROM python:3.11-slim
RUN pip install mlflow==2.0.1 boto3==1.26.12 awscli==1.27.22 psycopg2-binary==2.9.5
EXPOSE 5000
Once the docker image is created and pushed to the container registry of your choice, we create a deployment and service file for Kubernetes (similar to any other docker image deployment). A snippet of the deployment yaml is given below.
containers:
- image: avinashknmr/mlflow:2.0.1
imagePullPolicy: IfNotPresent
name: mlflow-server
command: ["mlflow", "server"]
args:
- --host=0.0.0.0
- --port=5000
- --artifacts-destination=$(MLFLOW_ARTIFACTS_LOCATION)
- --backend-store-uri=postgresql+psycopg2://$(MLFLOW_DB_USER):$(MLFLOW_DB_PWD)@$(MLFLOW_DB_HOST):$(MLFLOW_DB_PORT)/$(MLFLOW_DB_NAME)
- --workers=2
There are 2 main configurations here that took time for me to understand and configure —
- artifact’s location
- backend store
The artifact location will be a blob storage where your model file will be stored and can be used for model-serving purposes. But in our case, this is AWS S3 where all models are stored, and is a model registry for us. There are a couple of other options to store the model locally in the server, but whenever the pod restarts the data is done, and PersistentVolume is accessible only via the server. By using Cloud Storage, we can integrate with other services — for example, Seldon-Core can pick from this location to serve the model. The backend store stores all metadata required to run the application including model tracking — parameters and metrics of each experiment/run.
Seldon-Core
The second most trickiest of the three is Seldon-Core.
Seldon-Core is like a wrapper to your model that can package, deploy, and monitor ML models. This removes the dependency on ML engineers to make the deployment pipelines.
We did the installation using a Helm chart and Istio for ingress. There are 2 options for ingress — Istio & Ambassador. I’m not getting into setting up Istio, as the DevOps team did this setup. Seldon is installed with the below Helm and Kubectl commands.
kubectl create namespace seldon-system
kubectl label namespace seldon-system istio-injection=enabled
helm repo add seldonio https://storage.googleapis.com/seldon-charts
helm repo update
helm install seldon-core seldon-core-operator
--repo https://storage.googleapis.com/seldon-charts
--set usageMetrics.enabled=true
--set istio.enabled=true
--set istio.gateway=seldon-system/seldon-gateway
--namespace seldon-system
But assuming you have Istio set, below is the Yaml to set up Gateway and VirtualService for our Seldon.
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: seldon-gateway
namespace: seldon-system
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "*"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: seldon-vs
namespace: seldon-system
spec:
hosts:
- "*"
gateways:
- seldon-gateway
http:
- match:
- uri:
prefix: /seldon
route:
- destination:
host: seldon-webhook-service.seldon-system.svc.cluster.local
port:
number: 8000
Below is a sample k8s deployment file to serve the iris model from GCS. If using scikit-learn package for model development, the model should be exported using joblib and named as model.joblib .
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: iris-model
namespace: prod-data-science
spec:
name: iris
predictors:
- graph:
implementation: SKLEARN_SERVER
modelUri: gs://seldon-models/v1.16.0-dev/sklearn/iris
name: classifier
name: default
replicas: 1
In this example, we use SKLEARN_SERVER, but it has integrations for MLFLOW_SERVER, and TF_SERVER for MLflow and TensorFlow respectively.
Seldon-Core not only supports REST API but also gRPC for seamless server-server calls.
Conclusion
These tools are open source and deployable in Kubernetes, so they are cost-effective for small teams and also cloud-agnostic. They cover most challenges of a data science team like a centralized Jupyter Notebook for collaboration without version issues and serving models without dedicated ML engineers.
JupyterHub and Seldon-Core leverage the Kubernetes capabilities. JupyterHub spins up a pod for users when they log in and kills it when idle. Seldon-Core wraps the model and serves it as an API in a few minutes. MLflow is the only standalone installation that connects model development and model deployment. MLflow acts as a model registry to track models and store artifacts for later use.
Building a Data Science Platform with Kubernetes was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Building a Data Science Platform with Kubernetes
Go Here to Read this Fast! Building a Data Science Platform with Kubernetes