New ask Hacker News story: Ask HN: How does your data science or machine learning team handle DevOps?

Ask HN: How does your data science or machine learning team handle DevOps?
2 by mlthoughts2018 | 0 comments on Hacker News.
Machine learning teams often face operating needs not seen in many other domains. Some example: - instrumenting observability that not only monitors data quality and upstream ETL job status, but also domain specific considerations of training ML models, like overfitting, confusion matrices, business use case accuracy or validation checks, ROC curves and more (all needing to be customized and centrally reported per each model training task). - standardizing end to end tooling for special resources, eg queueing and batching to keep utilization high for production GPU systems, high RAM use cases like approximate nearest neighbor indexes, and just run of the mill stuff like how to take a trained model and deploy it behind a microservice in a way that bakes in logging, tracing, alerting, and more. Machine learning engineers and data scientists tend to have a comparative advantage when they can focus on understanding the data, running experiments to decide which models are best, pairing with product managers or engineers to understand constraints around the user experience, and designing software tools and abstractions around unique training or serving architectures (like the GPU queuing example). Increasingly teams of data scientists are required to do devops work configuring and maintaining eg kubernetes & CI/CD workloads, alerting and monitoring, logging, instrumenting security or data access control compliance solutions. This is harmful because it reduces the time or effort these engineers can spend on their comparative advantages, a direct loss to the customer or user, at the expense of doing devops jobs they are not trained to do and not interested in (which leads data scientists to burnout often) and that many other non-specialists can do. How do you structure teams, build tools and establish compliance or operations expectations that allow data scientists and related statistical scientists and ML backend engineers to flourish?

Comments