github: environment mlflow client repository
MLFlow is an opensource Databricks product that supports part of the Machine Learning model lifecycle. MLFlow tracking components support Data Scientists during experimentation and allow models to be selected on the performance metrics for use in ML systems. The Model registry allows model versions to be registered under model names to facilitate model deployment. We want to use one MLFlow instance and one Databricks workspace to support multiple deployment targets (acceptance, staging, production, etc), while providing security guarantees for production models. We developed an MLFlow client that inherits from the vanilla MLFlow client modifying and extending it to manage multiple environments in one model registry in a secure manner. Our client works in tandem with Databricks permissions of which we will show the terraform snippets.
1 Why one Databricks workspace?
A secure Databricks workspace is deployed in virtual networks, which require separate subnets for the control and the compute planes. The IP range of the compute plane subnet constrains the maximum amount of parallel compute nodes that can be used simultaneously. Due to company wide IP range size limitations we chose to use one large subnet for one Databricks workspace instead of multiple smaller subnets tied to multiple Databricks workspaces. Thus, we want to manage several logical environments (acceptance, preproduction and production) within one Databricks workspace.
1.2 Scaling with Databricks pools
Databricks pools allow reuse of idle instances (virtual machines) during cluster creation. Pools provide moderate start and auto-scaling speed benefits. Each instance in the pool requires one IP from the compute plane subnet. If the maximum is reached the next instance creation request will fail, without any elegant error handling or backoff mechanism. This limitation of the Databricks pools makes it important to have a large subnet for your activities as you can hit the maximum limit at any time and your applications will crash.
2 Identity based security
Our security approach is based on identities, mostly application registrations in Azure and Databricks users backed by ADD users. For instance, we run our acceptance tests in the same Databricks workspace on production data using separate identities that have read-only permissions copying production data to transient acceptance test storage accounts. Identities are very convenient on Azure as Databricks supports credential passthrough leveraging all the Azure AD roles and groups that we set on the identities. We also rely on Databricks permissions and groups as we work with multiple identities in one Databricks workspace.
2.1 Databricks groups
Each logical environment in our single Databricks workspace has a Databricks group for its application principals named f“apps_{env_name}” (i.e. apps_acc, apps_prod) and a group that contains all active application principals “apps_all”. The groups and their permissions are managed with terraform.
3 Environment MLFlow client
3.1 MLFlow experiment tracking
The MLFlow experiments storage is adapted to support multiple logical environments by managing the storage location per environment. Our solution assigns directories such as “/experiments/acc” and “experiments/prod” to store experiment data. Databricks permissions management is used to give directory rights to the respective Databricks groups (in this example “apps_acc” and “apps_prod”). This allows for secure logging of experiments and models for each of the logical environments, without the user having to think about it.
3.2 MLFlow model registry
The MLFlow model registry is a central place to register models and model versions for use in ML systems. Models logged and registered during data science experiments in different environments end up in the same model registry. The MLFlow model registry is harder to adapt for use with multiple logical environments. Actually, the deployment target concept in MLFlow assumes that each version of the a specific model can have a different “stage”. The model stage is a property of a model version registered in the MLFlow model registry. It can be set to “None”, “Staging”, “Production” and “Archived”. Although helpful the model stage is quite limiting because we cannot define our own values for it, thus it does not fully support our need to define various logical environments. We do use it to manage permissions on our model versions with Databricks, which we will describe later.
Model naming
We decided that model naming will be our first layer of differentiating between any number of logical environments. Every registered model name is postfixed with a environment identifier f”{model_name}_{env_name}”. Our MLFlow client manages this naming transparently during model registration and retrieval based on the environment name it gets from a system environment variable or the environment name passed into its constructor. Similar to the experiment management the interface is mostly unchanged compared to vanilla MLFlow, due to our abstraction within the environment MLFlow client.
Model permissions
Our goal is to separate production models from any other environment and prevent any non-production principal to register or retrieve production model versions. As mentioned, the model Stage can be set to “None”, “Staging”, “Production” and “Archived”. The Databricks permission management hooks into the values of the model Stage. We can control who can set the “Staging” and “Production” model Stage values and who can manage models with these model Stage values using the “CAN_MANAGE_STAGING_VERSIONS” and “CAN_MANAGE_PRODUCTION_VERSIONS” permissions. The “Production” Stage permission also allows principals to access “Staging” Stage model versions and transition them to production model versions.
To separate Production models from models from other environments we assign the model Stage automatically when registering a model version. All non-production environments assign the model stage value to “Staging”. Model versions registered from the production environment are assigned the “Production” stage value. We leverage the Databricks permissions assigned to our Databricks groups shown below to securely manage our Production models separate from other logical environments.
3.3 Getting it together
Apart from abstractions on top of the vanilla MLFlow client we have added a few methods wrapping various actions for your convenience. One of the the extra methods is the “log_model_helper”, which handles the various steps to log and register a model version and set the appropriate model stage.
Registering a model version is commonly done in two steps, first we log a model during an experiment run and then we register the logged model as a model version of a registered model. Logging a model during an experiment run returns a ModelInfo object, which contains a model_uri pointing to the local artifact location. We use the model_uri to register a model version under a registered model name and upload it to the MLFlow registry. Registering a model version returns a ModelVersion object that tells us the autoincremented model version and current_stage of the model which will always be “None” right after creation.
Our third step during model version registration is to transition the model version stage from “None” to “Staging” or “Production” depending on the logical environment. Everyone can create model versions, but the Stage transitions are restricted with the Databricks permissions. These three steps are wrapped in the “log_model_helper” method. Using this helper method, we can assume that all registered model versions have an environment aware name and appropriate stage value.
3.4 Testing our MLFlow client
For us to maintain our own client we need to test it in detail to check compliance to our permissions design. Any changes in the upstream MLFlow client are tested at the same time, for instance a restructuring of the MLFlow modules. We start a local MLFlow server within a PyTest fixture that is scoped to the whole testing session. The Python tempfile module is used to generate a temporary artifact location and a temporary sqlite database file. The goal of our test session is to log and register a model and perform various mutations and retrievals on it. The first step is to log and register a model version in the empty model registry. We perform this inside a PyTest fixture as all other tests depend on it, although technically it is a test itself as it includes asserts.
Conclusion
Our extended environment MLFlow client allows us to register models from multiple logical environments into the same model registry. It leverages the minimal permissions options in Databricks to securely separate development environments from production. In addition, the MLFlow API is largely unchanged and the same across environments.