ml-engineer

Build production ML systems with PyTorch 2.x, TensorFlow, and modern ML frameworks. Implements model serving, feature engineering, A/B testing, and monitoring. Use PROACTIVELY for ML model deployment, inference optimization, or production ML infrastructure.

Author

Install

Hot:13

Download and extract to your skills directory

Copy command and send to OpenClaw for auto-install:

Download and install this skill https://openskills.cc/api/download?slug=sickn33-skills-ml-engineer&locale=en&source=copy

ML Engineer - Production-Grade Machine Learning Systems Expert

Skills Overview

An ML Engineer is an intelligent assistant focused on building production-grade machine learning systems. Proficient in modern ML frameworks such as PyTorch 2.x and TensorFlow 2.x, providing end-to-end support including model deployment, feature engineering, A/B testing, and monitoring.

Use Cases

1. Model Deployment and Serving


Deploy trained machine learning models to production environments, including using model serving frameworks such as TensorFlow Serving, TorchServe, and MLflow. Build highly available inference APIs supporting both real-time and batch inference modes.

2. Inference Performance Optimization


Optimize inference speed and throughput in production by using techniques such as model quantization, pruning, batching, and caching. Reduce latency and resource consumption, and support hardware acceleration like GPU/TPU.

3. Building ML Infrastructure


Build an end-to-end MLOps pipeline, including feature storage, model monitoring, A/B testing, continuous training, etc. Use containerization technologies such as Docker and Kubernetes to implement scalable ML infrastructure.

Core Capabilities

1. Modern Frameworks and Distributed Training


Support mainstream frameworks including PyTorch 2.x (including torch.compile, FSDP), TensorFlow 2.x, and JAX/Flax. Provide distributed training capabilities (DDP, DeepSpeed, Horovod), hyperparameter optimization (Optuna, Ray Tune), and experiment tracking (MLflow, Weights & Biases).

2. Model Serving and Deployment


Cover model serving platforms such as TensorFlow Serving, TorchServe, MLflow, and BentoML. Support containerized deployment (Docker, Kubernetes), cloud ML services (AWS SageMaker, Azure ML, GCP Vertex AI), API frameworks (FastAPI, gRPC), and edge deployment solutions.

3. Feature Engineering and Data Management


Provide comprehensive feature engineering solutions including feature stores (Feast, Tecton), data processing (Spark, Pandas, Polars), data validation (Great Expectations, TFDV), and pipeline orchestration (Airflow, Kubeflow, Prefect). Support both batch and real-time feature serving.

Common Questions

How do you deploy a PyTorch model to production?

There are multiple ways to deploy PyTorch models: use TorchServe as a dedicated model server; build a custom inference service with FastAPI/gRPC; or use cloud platforms such as AWS SageMaker. It’s recommended to export the model to TorchScript or ONNX formats to improve compatibility and performance, deploy it with Docker containers, and enable elastic scaling via Kubernetes. For high-concurrency scenarios, combine batching and caching strategies to optimize throughput.

Which is better: TensorFlow Serving or TorchServe?

Both have their strengths. TensorFlow Serving has a mature ecosystem and supports model versioning, multi-model serving, and hot updates, making it suitable for TensorFlow models. TorchServe is the official PyTorch solution, offering similar model management features, supporting multi-model batching and logging. The choice mainly depends on your model framework: prefer TensorFlow Serving for TensorFlow models and TorchServe for PyTorch models. If your team uses both frameworks, consider framework-agnostic solutions such as MLflow or BentoML.

How do you detect model drift in production?

Model drift detection requires a robust monitoring system: use Evidently AI, Arize, or custom monitoring tools to track changes in data distributions (feature drift) and changes in model predictions (prediction drift). Key metrics include PSI (Population Stability Index) and KL divergence statistics. It’s recommended to set alert thresholds and trigger a retraining workflow when drift exceeds those thresholds. Also track business metrics (such as conversion rate and click-through rate), because declines in model metrics may not immediately reflect in business metrics.