Adarga Open Sources its Kubernetes Operator for Flyte Workflow Registration
Adarga deploys gold-standard MLOps processes, enabling our customers to leverage cutting-edge machine learning solutions at speed and with confidence.
We use Flyte to run our Machine Learning (ML) pipelines and Kubernetes to host all our services. One of the challenges we faced when adopting Flyte was managing workflows across our different Kubernetes environments. Each workflow can have different versions and a process for promoting workflows through the environments had to be devised.
To streamline the process of promoting these workflows across environments, we built a custom Kubernetes Operator for Flyte workflow registration. Today, we’re excited to announce that we are open-sourcing this operator to benefit the broader Kubernetes and Flyte community. In this blog post, we’ll walk you through the key concepts, implementation details, and the journey of building the operator.
Background: Why a Kubernetes Operator?
Flyte is an open-source platform designed for orchestrating machine learning and data workflows at scale, leveraging Kubernetes for deployment. However, the conventional approach to registering workflows in Flyte, which involves using flytectl or pyflyte in CI pipelines, had a significant drawback: it posed a security risk as it required external access to the Flyte cluster.
In our case, we wanted to:
-
Keep workflows registered securely and entirely within the Kubernetes environment.
-
Reduce the overhead of manual registration.
-
Improve scalability and manage the growing number of Flyte workflows efficiently.
The solution? A custom Kubernetes operator to automate Flyte workflow registration within the cluster.
What is a Kubernetes Operator?
Before diving into the implementation, let’s quickly review what a Kubernetes Operator is.
A Kubernetes Operator is a method for managing complex Kubernetes applications by extending the Kubernetes API. Operators automate application lifecycle management tasks—such as deployment, scaling, and failure recovery—by leveraging Kubernetes’ native tooling (kubectl
) and APIs. Operators typically consist of two components:
-
Custom Resource Definitions (CRDs): These are used to define new types of resources within Kubernetes (e.g. our custom
FlyteRegistration
CRD). -
Controller: The controller watches these resources and reconciles the system's actual state with the desired state. In our case, it registers Flyte workflows when the associated CRDs are created or modified.
The Problem: Traditional Flyte Workflow Registration
In Flyte, workflows need to be registered before they can be executed. Traditionally, this is done through a CI pipeline using the flytectl
command-line tool. However, this method requires external access to the Flyte instance, making it susceptible to security risks. Additionally, this process is not ideal for managing the growing number of workflows we need to promote across different environments.
Our goal was to create a Kubernetes-native solution that automates this process while maintaining security.
Building the Flyte Workflow Registration Kubernetes Operator
Step 1: Using Kubebuilder to Scaffold the Operator
Building Kubernetes operators from scratch can be quite complex, so we turned to Kubebuilder, a framework that streamlines the process of building operators. Kubebuilder provides a solid foundation by automatically generating the necessary boilerplate code to interact with Kubernetes APIs. This allowed us to focus on the specific business logic around Flyte workflow registration.
Step 2: Defining the Custom Resource Definition (CRD)
Our custom resource is the FlyteRegistration
CRD, which defines the state for a Flyte workflow that needs to be registered. Here’s an example of how this looks:
This CRD includes the following fields:
-
workflowName
: The name of the Flyte workflow. -
workflowVersion
: The version of the workflow. -
workflowProject
: The Flyte project the workflow belongs to. -
workflowDomain
: The Flyte domain under which the workflow will be registered. -
workflowPackageUri
: The URL of the workflow package, stored in JFrog Artifactory or any OCI artifact compatible repository.
Step 3: Automating the Workflow Registration
The operator’s primary function is to watch for changes to FlyteRegistration
CRDs. When a new CRD is detected, the operator performs the following steps:
-
Downloads the Workflow Package: The operator fetches the corresponding Flyte workflow package from the repository where our workflows are stored.
-
Registers the Workflow: Using
flytectl
, the operator registers the workflow with the Flyte instance running in the Kubernetes cluster.
Step 4: Deployment Using Helm Chart
Kubernetes operators can be complex to deploy due to the various configurations involved. To simplify this, we used Helm to package the operator and its associated Kubernetes resources. Since Kubebuilder generates a dynamic configuration, we used helmify to automatically generate the Helm chart. The Helm chart is then stored in our public Docker Hub repository and can be retrieved with the following command:
This deployment workflow ensures that the operator is deployed consistently across environments.
Step 5: Integrating the Operator with Flyte Workflow
Once deployed, the operator continuously monitors the FlyteRegistration
CRDs and ensures that workflows are correctly registered with the Flyte instance. This integration improves the operational efficiency of Flyte workflow promotion across environments without the need for manual intervention.
Challenges Faced
While building this operator, we encountered several challenges:
1. Integration with Flyte Golang SDK
Our initial attempt was to use the Golang and call the Flyte API but that was cumbersome. The approach required unpacking the entire workflow into raw protobuf files, which made the process more complex than necessary. Flyte does not yet provide a Go SDK but during one of their community syncs mentioned they would love to have one but need help.
We opted to use (flytectl
) in the operator to streamline the workflow registration process.
2. Testing the Operator
Testing the operator against a live Flyte instance proved challenging due to the complexities of setting up a fully functional Flyte environment. Flyte's CLI tool (flytectl
) builds a mini Kubernetes cluster dynamically, which interfered with our ability to run integration tests. As a result, we used mocks and tested the operator’s reconcile logic directly.
Conclusion
The development of a custom Kubernetes operator for Flyte workflow registration has been an exciting and rewarding project. By automating the process of registering workflows within a Kubernetes cluster, we’ve significantly reduced the complexity of managing Flyte workflows across different environments.
Our operator has reduced our Machine Learning model workflow deployment times, increased model trustworthiness, and enabled us to quickly adapt to changing customer needs. This is vital, especially as our work in defence and national security requires us to deploy to a range of different environments.
By open-sourcing this operator, we hope to contribute to the Flyte and Kubernetes ecosystems, enabling other teams to automate their workflow registration process while maintaining security and efficiency.
You can access the code for the Flyte Workflow Registration Kubernetes Operator on GitHub.
We encourage you to contribute, raise issues, and help us improve it for the community.
Thank you to our engineering, MLOps, and platform teams for making this available. To find out more about our AI and Machine Learning services, get in touch here.