CI/CD4ML or: How we learned to stop worrying and love Machine Learning

Enabling organizations to embrace Data Science workflows by adopting an MLOps lifecycle.

Ilias Sarantopoulos
Beat Engineering Blog

--

At Beat we do Machine Learning and Data Science with a product engineering mindset. We work in cross-functional teams to translate product features into predictive modeling and machine learning problems and vice versa. Exploratory analysis and hypothesis testing using the tools provided by Beat’s Big Data capabilities team help us build a deep understanding of the behavior of our millions of daily users (passengers & drivers).

Then comes the best part: taking it to production. We care deeply about the scalability and performance of training and inference. We put effort to productionize and monitor pipelines as well as setup telemetry data collection required to monitor our models in production. This way, we take on full ownership of our models running in production.

It’s really important to understand how AI/ML projects work and what that means for every organization. Let’s go deeper to find out more about it.

Why do AI/ML projects fail?

There have been numerous reviews, articles, and blog posts claiming that a rough 85% of Artificial Intelligence projects will fail in the upcoming years. On the other hand, according to the Harvard Business Review, the fact that companies will capitalize on AI, is bound to add around $13 trillion to the global economy over the next decade or so. When people refer to AI, contradictory to what people like to say, at the moment the industry has its focus on Machine Learning/Data Science applications. Real Artificial Intelligence, as in Artificial General Intelligence is not here yet, although that is the end goal.

The above two contradictory statements reveal that the way organizations incorporate these data-driven methods alongside their standard business operations will leave space for a lot more growth in the global economy.

There are numerous reasons why AI projects fail. Some have to do with corporate culture and some have to do with processes. As this article is written from the view of the Data Scientist/Machine Learning Engineer we focus on the latter. The way to transform a traditional corporate structure into a data-driven organization has to do with culture, company values and is the second (looong) step after digital transformation.

Will unicorns bring success?

If we narrow it down to processes, there are a couple of caveats that can indeed be tackled. Recently we gave a talk at PyData Eindhoven 2020 where we introduced “Kathy”. Kathy is a fictional character and represents the AI unicorn, the type of employee every company is looking for when hiring for similar positions. Below, you can see the skills that such individuals would have.

Who is Kathy?

Kathy: a Data Science unicorn 🦄

Get to know more about her skills:

  • Probability/Statistics
  • Deep understanding of ML algorithms
  • Experience with training/debugging Deep Learning Models
  • Software Engineering skills
  • Data Engineering (aka Big Data)
  • DevOps skills

Of course, people like Kathy are really hard to find, and that’s why they are considered unicorns. An alternative is building cross-functional teams that compose the above diverse skill set. The above list refers to hard skills, but there is something of course that is equally important: business acumen. The projects we are discussing have the ambition to interfere in business operations, automate manual processes, and eventually replace them with better alternatives. “Better” means that they should add value either as direct revenue or they are considered as a future investment. The reason we chose to use a unicorn for this job is to make it clear that even if you get the best people for the job, they will struggle in succeeding if they don’t follow the appropriate processes.

Carve Success with CI/CD pipelines

Example of a continuous training workflow

Data Science and Machine Learning in organizations stand in the sweet spot between Engineering and Business, with the need to borrow stuff from both “worlds”. Over the last couple of years especially after Google showcased the “Hidden Technical Debt in Machine Learning Systems”, industry pioneers have been preaching about CI/CD for Machine Learning. Traditional software engineering has long solved some of the challenges that Data Science is facing nowadays. ML workflows need to integrate a CI/CD lifecycle model by adopting it to the circular nature of data science tasks. Indeed this characteristic, the circular nature of ML workflows, is the main differentiating factor between them and traditional software.

At Beat, many of our operations involve tasks that translate into complex data science problems. This is due to the spatial and temporal nature of our data, but as well from the two-sided marketplace we operate on. Beat’s operations serve both passengers as well as drivers in the best possible way.

That’s why we have given the focus on two things when it comes to data science in production:

  • CI/CD4ML: Our focus is on productionizing the whole process in a seamless way. Automation is a basic component of Continuous delivery and automating model training, evaluation and re-deployment are essential. That means that our ML engineers are in fact “Full Stack Data Scientists”. We really like the approaches laid out in CD4ML and The Full Stack Deep Learning course.
  • Continuous Feedback: As we mentioned earlier Data Science teams live somewhere between Product (Business) and Engineering (IT) of organizations. Breaking the silos and bringing cross-functional teams to work together is essential for AI projects to succeed. The dynamics of data is the main reason that these projects cannot be solved by just analysing requirements and then creating the solution. As hard requirements are really tough to set and would be wrong to do so in the first place (e.g. a classification model would only go to production if it achieves an F score above 90%), feedback should be available throughout the whole process.

In 2021, Beat will continue to innovate, collaborate, and contribute to Beat services through the application of machine learning and data science across our business. Data science is fundamental to Beat’s growth as a company and its ability to deliver safer and more reliable experiences on our app. Find below the video recording of our talk at Pydata Eindhoven 2020 to learn more about ‘’Kathy’’.

Beat at Pydata Eindhoven 2020

For more related articles, be sure to check out Beat Engineering Blog. To join us on the ride, check out all our open positions and apply.

Ilias Sarantopoulos is a Machine Learning Engineer in the Pricing Domain and a member of the Machine Learning Chapter at Beat. A Data Scientist with a CS background who likes to solve real-world problems with data. He is also a mentor for Data Science & Data Analytics coding bootcamps. He tries to find the sweet spot between Data Science, software engineering, and, of course, life.

--

--