Machine Learning DevOps


Machine learning models make predictions for new data based on the data they have been trained on. It is essential that the data is clean, correct, and safe to use without any privacy or bias issues. Real-world data can also continuously change, so inputs and predictions have to be monitored for any shifts that may be problematic for the model. These are complex challenges that are distinct from those found in traditional DevOps.

  • Motivation

DevOps practices are centred on the “build and release” process and continuous integration. Traditional development builds are packages of executable artifacts compiled from source code. Non-code supporting data in these builds tends to be limited to relatively small static config files. In essence, traditional DevOps is geared to building programs consisting of sets of explicitly defined rules that give specific outputs in response to specific inputs.

In contrast, machine-learning models make predictions by indirectly capturing patterns from data, not by formulating all the rules. A characteristic machine-learning problem involves making new predictions based on known data. Machine-learning builds run a pipeline that extracts patterns from data and creates a weighted machine-learning model artifact. This makes these builds far more complex and the whole data science workflow more experimental. As a result, a key part of the MLOps challenge is supporting multi-step machine learning model builds that involve large data volumes and varying parameters.

To run projects safely in live environments, we need to be able to monitor for problem situations and see how to fix things when they go wrong. There are pretty standard DevOps practices for how to record code builds in order to go back to old versions. But MLOps does not yet have standardisation on how to record and go back to the data that was used to train a version of a model.

There are also special MLOps challenges to face in the live environment. There are largely agreed DevOps approaches for monitoring for error codes or an increase in latency. But it’s a different challenge to monitor for bad predictions. You may not have any direct way of knowing whether a prediction is good, and may have to instead monitor indirect signals such as customer behaviour. It can also be hard to know in advance how well your training data represents your live data. For example, it might match well at a general level but there could be specific kinds of exceptions. This risk can be mitigated with careful monitoring and cautious management of the rollout of new versions.

  • Tool landscape

The effort involved in solving MLOps challenges can be reduced by leveraging a platform and applying it to the particular case. Many organisations face a choice of whether to use an off-the-shelf machine-learning platform or try to put an in-house platform together themselves by assembling open-source components. Some machine-learning platforms are part of a cloud provider’s offering. This may or may not appeal, depending on the cloud strategy of the organisation. Other platforms are not cloud-specific and instead offer self-install or a custom hosted solution. Instead of choosing a platform, organisations can instead choose to assemble their own. This may be a preferred route when requirements are too niche to fit a current platform, such as needing integrations to other in-house systems or if data has to be stored in a particular location or format. Choosing to assemble an in-house platform requires learning to navigate the ML tool landscape. This landscape is complex with different tools specialising in different niches and in some cases there are competing tools approaching similar problems in different ways.

  • Governance

Challenges around reproducibility and monitoring of machine learning systems are governance problems. For many projects these are not the only challenges as customers might reasonably expect to be able to ask why a prediction concerning them was made. Explainability is a data science problem in itself. Modelling techniques can be divided into “black-box” and “white-box”, depending on whether the method can naturally be inspected to provide insight into the reasons for particular predictions. With black-box models, such as proprietary neural networks, the options for interpreting results are more restricted and more difficult to use than the options for interpreting a white-box linear model. In highly regulated industries, it can be impossible for AI projects to move forward without supporting explainability. For example, medical diagnosis systems may need to be highly interpretable so that they can be investigated when things go wrong or so that the model can aid a human doctor. This can mean that projects are restricted to working with models that admit of acceptable interpretability. Making black-box models more interpretable is a fast-growth area, with new techniques rapidly becoming available.


  • Product teams, CI/CD, automated testing and IT are working closely with the business. However, where are your data scientists and ML engineers? Bring them closer as well - but there is no need to call it DevMLOps, ok?

  • Data scientists usually come from an academic background and are afraid of sharing models that they do not consider good enough - create a safe environment for them to fail!

  • Continuous integration and continuous deployment (CI/CD) are amazing best practices that can also be applied to machine learning components.

  • More than CI/CD, we should do continuous evaluation of the models - version algorithms, parameters, data and its results.

  • Machine learning bugs are not just functions returning wrong values, they can cause bias, accuracy drift, and model fragility.

  • The fact that machine learning development focuses on hyperparameter tuning and data pipelines does not mean that we need to reinvent the wheel or look for a completely new way. DevOps lays a strong foundation which Data science practitioners must also absorb a lot of the industry gains from the last year, a direct result of the DevOps culture:

    • culture change to support experimentation, experimentation/failure in its core,
    • continuous evaluation,
    • deployable artifacts,
    • a sharing culture,
    • abstraction layers,
    • observability, and
    • working in products and services.
  • What are the challenges that come with artificial intelligence and machine learning regarding integrating with development and deployment?

They are the same problems that we solved for traditional development but with a different perspective now: version control, packaging, deployment, collaboration and serving. The main issue is that we are trying to force the solutions we used before in software development, into this ecosystem. In the last year, we saw a significant increase in the number of products (especially Open Source) trying to solve ML's development lifecycle - Spark runs on top of Kubernetes, tensorflow, kubeflow, MLflow, Uber's Michelangelo, and cloud providers giving tools that allow the training and serving of models. We are witnessing the maturation of this ecosystem, and it's a growing environment.

  • What about bugs and testing... how does that work with ML components?

Concerning the bugs it is important to keep notice of machine learning bugs: Bias, Drift and Fragility.

Bias comes from the bias that exists on the datasets used to build the feature and can have catastrophic results, especially when used on blackbox-like models. Cathy O'Neil's Weapons of Math Destruction is a book from 2016 that raised a lot of these problems in algorithms making important decisions concerning hiring, classifying people and others.

Drift occurs when models are built, working well and deployed. You may consider the job over and nothing else is needed, right? Unfortunately not. The model must be recalibrated and resynced according to the usage and data to keep the accuracy. Otherwise, it will drift and become worse with time.

Fragility is related to bias, but more related to changes outside of the team's reach. A change in definition, data that becomes unavailable, a null value that should not be there… how can your model cope with these issues, how fragile is it?

The worst part is, the majority of these bugs in ML cannot be identified before production. That is why monitoring and observability, others pillars from DevOps, play a gigantic role in machine learning components. You must measure proxies that identify the business value that your ML components should impact. For example, have you created a recommendation engine, and are you applying an AB test strategy to roll-out? You cannot directly track ML components, but you may be able to analyze proxy measures on it. These types of metrics and focusing on measuring can help you to detect and approach the ML bugs early on: bias, drift, and fragility.

  • Distance between data science and operations. What is causing this distance?

The same problem that affected (and still does) the business world and "gave birth" to the DevOps movement - a distance between the business and the actual industrialization/operationalization of what is built.

This gap is a result of three things:

  • slowness (things flowing from idea to production taking a gigantic time),

  • lots of handovers (X talks to client, A writes the user story, B builds, C validates, D approves, E deploys, F operates, G fixes bugs, H rebuilds,...),

  • and clustered teams working on projects, not products.

  • What can we do to decrease distances and improve collaboration?

The hardest thing that any organization can do: change the culture.

In the case of ML engineers and data scientists, some cultural aspects can impact a lot, but the most compelling one I have seen is related to the background of the professionals. The majority of them have a very academic background, meaning that they are used to spending long periods working on one problem until it is good enough to be accepted in a publication. The bar there, to be good enough, is extremely high, not just on some metrics but also on the design of the experiments, mathematical rigor, and so on. In a business context this is important, but less so... That means that it is OK to publish a model with 60% accuracy and have it on a deployable state. It is better to have that ready and consider putting it in production today, than waiting months to have something "good enough". Maybe in three months that will not be a problem worth solving anymore. Moving fast with flexible possibilities is the best way to go.

  • What’s your advice for companies who want to reap the benefits from applying artificial intelligence and machine learning? What should they do or not do?

Some cultural characteristics I have seen that support a short time-to-market and where a lot of value is generated from data science include:

  • Data science: the "science" part indicates experimentation and failed tryouts. Not all experiments succeed, so data science will not produce mind-blowing solutions at all times. Sometimes you can go into a rabbit hole. Alternatively, the business may change. If you are a data scientist working on a project for a few days and you see no future in it, do you have the courage and autonomy to tell that to your boss/stakeholders? Likewise, the other way around... can they come to you at any time and say that the business changed and we must pivot?

  • More than CI/CD, we need to talk about CE - continuous evaluation. Every time a new model is tested - it can be new hyperparameters in the same algorithm or a completely new algorithm - we must be able to compare it with previous runs. How accurate was it? Why the result different? Are we using the same dataset?

  • Share not only your good models, but also the ones that are a total flunk! Version control your code and your models, at all times! Learn to use git at every moment! Why? Because when someone else sees that, they will not try it again with the same datasets and parameters... stop the waste!

  • Provide platforms and tools for the data science to abstract the things they do not know (and they do not need to know). A data scientist may not know if they want to serve their models in REST or gRPC, how to package the model, if it should be deployed on the Kubernetes cluster or how many requests per second it should withstand - this is not their expertise. A team must provide a platform and a have a pipeline to do that and let the decisions be taken, experimented with and changed. Every company has its flow, ways of working and ideas... do not bend the culture to the tool.

  • Work on products and services, not projects. Developers, security specialists, SRE's... everyone should be involved and help. By doing this, you can make sure that you have deployable artifacts from day one! After it is deployed, the job is not over... You have to operate, monitor, refactor, calibrate and do several things with ML models that are running on production.