ml-engineer
Build production ML systems with PyTorch 2.x, TensorFlow, and modern ML frameworks. Implements model serving, feature engineering, A/B testing, and monitoring. Use PROACTIVELY for ML model deployment, inference optimization, or production ML infrastructure.
Author
Category
AI Skill DevelopmentInstall
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
ML Engineer - Production-Grade Machine Learning Systems Expert
Skills Overview
An ML Engineer is an intelligent assistant focused on building production-grade machine learning systems. Proficient in modern ML frameworks such as PyTorch 2.x and TensorFlow 2.x, providing end-to-end support including model deployment, feature engineering, A/B testing, and monitoring.
Use Cases
1. Model Deployment and Serving
Deploy trained machine learning models to production environments, including using model serving frameworks such as TensorFlow Serving, TorchServe, and MLflow. Build highly available inference APIs supporting both real-time and batch inference modes.
2. Inference Performance Optimization
Optimize inference speed and throughput in production by using techniques such as model quantization, pruning, batching, and caching. Reduce latency and resource consumption, and support hardware acceleration like GPU/TPU.
3. Building ML Infrastructure
Build an end-to-end MLOps pipeline, including feature storage, model monitoring, A/B testing, continuous training, etc. Use containerization technologies such as Docker and Kubernetes to implement scalable ML infrastructure.
Core Capabilities
1. Modern Frameworks and Distributed Training
Support mainstream frameworks including PyTorch 2.x (including torch.compile, FSDP), TensorFlow 2.x, and JAX/Flax. Provide distributed training capabilities (DDP, DeepSpeed, Horovod), hyperparameter optimization (Optuna, Ray Tune), and experiment tracking (MLflow, Weights & Biases).
2. Model Serving and Deployment
Cover model serving platforms such as TensorFlow Serving, TorchServe, MLflow, and BentoML. Support containerized deployment (Docker, Kubernetes), cloud ML services (AWS SageMaker, Azure ML, GCP Vertex AI), API frameworks (FastAPI, gRPC), and edge deployment solutions.
3. Feature Engineering and Data Management
Provide comprehensive feature engineering solutions including feature stores (Feast, Tecton), data processing (Spark, Pandas, Polars), data validation (Great Expectations, TFDV), and pipeline orchestration (Airflow, Kubeflow, Prefect). Support both batch and real-time feature serving.
Common Questions
How do you deploy a PyTorch model to production?
There are multiple ways to deploy PyTorch models: use TorchServe as a dedicated model server; build a custom inference service with FastAPI/gRPC; or use cloud platforms such as AWS SageMaker. It’s recommended to export the model to TorchScript or ONNX formats to improve compatibility and performance, deploy it with Docker containers, and enable elastic scaling via Kubernetes. For high-concurrency scenarios, combine batching and caching strategies to optimize throughput.
Which is better: TensorFlow Serving or TorchServe?
Both have their strengths. TensorFlow Serving has a mature ecosystem and supports model versioning, multi-model serving, and hot updates, making it suitable for TensorFlow models. TorchServe is the official PyTorch solution, offering similar model management features, supporting multi-model batching and logging. The choice mainly depends on your model framework: prefer TensorFlow Serving for TensorFlow models and TorchServe for PyTorch models. If your team uses both frameworks, consider framework-agnostic solutions such as MLflow or BentoML.
How do you detect model drift in production?
Model drift detection requires a robust monitoring system: use Evidently AI, Arize, or custom monitoring tools to track changes in data distributions (feature drift) and changes in model predictions (prediction drift). Key metrics include PSI (Population Stability Index) and KL divergence statistics. It’s recommended to set alert thresholds and trigger a retraining workflow when drift exceeds those thresholds. Also track business metrics (such as conversion rate and click-through rate), because declines in model metrics may not immediately reflect in business metrics.