server-management
Server management principles and decision-making. Process management, monitoring strategy, and scaling decisions. Teaches thinking, not commands.
Author
Category
Other ToolsInstall
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
Server Management - Server Management Principles and Operations Decision Guide
Skill Overview
Server Management provides the core principles and decision framework for managing servers in production environments. It helps you build the right operational mindset—learning how to think rather than memorizing commands.
Applicable Scenarios
When you need to deploy applications to production, this skill offers a complete decision framework—from process management to monitoring and alerting. It helps you choose the right tools (PM2, systemd, Docker, Kubernetes) and establish a stable service management plan.
When services behave abnormally, performance drops, or resources become tight in production, it provides a systematic troubleshooting priority order: check process status first, then analyze logs, evaluate resource usage next, and finally check the network and dependent services.
When facing performance bottlenecks such as excessively high CPU, memory overflows, or slow responses, it helps you decide when to scale vertically versus horizontally, and how to design automatic scaling strategies to handle traffic fluctuations.
Core Features
Based on the application type and tech stack, it provides decision guidance for process management solutions: Node.js applications recommend PM2 (supports clustering and hot reloads), general applications use native Linux systemd, containerized environments use Docker/Podman, and large-scale orchestration uses Kubernetes or Docker Swarm. It enables crash auto-restart, zero-downtime hot reloads, multi-core clustering, and persistent runtime.
Guides you to build a complete monitoring system: availability monitoring (uptime, health checks), performance monitoring (response time, throughput), error monitoring (error rates and types), and resource monitoring (CPU, memory, disk). Depending on your needs, choose suitable monitoring setups: for simple scenarios, use PM2 metrics and htop; for full-stack observability, use Grafana or Datadog; for error tracking, use Sentry; and for availability monitoring, use UptimeRobot or Pingdom.
Clarifies log categories (application logs for debugging and auditing, access logs for traffic analysis, error logs for issue detection) and core principles: log rotation prevents disks from filling up, structured logs (JSON format) are easier to parse, set appropriate log levels, and prohibit recording sensitive data.
Matches symptoms with solutions: high CPU means adding instances (horizontal scaling), high memory means adding memory or fixing leaks, slow responses require performance analysis before scaling, and traffic spikes call for automatic scaling. Understand the use cases and trade-offs of vertical scaling (quick fixes, single instance) versus horizontal scaling (sustainable, distributed).
Defines multi-dimensional checks for service health: HTTP 200 responses, normal database connections, reachable dependent services, and no resource exhaustion. Choose simple checks (return 200 only) or deep checks (verify all dependencies) based on the needs of your load balancer.
Common Questions
What tools should be used for server management?
Tool selection depends on your application type and tech stack:
The key is not to memorize every tool, but to understand the applicable scenarios and trade-offs for each.
How can I implement automatic server process restarts?
Different tools have different configurations:
pm2 start app.js --name myapp.Restart=always and RestartSec=10 to enable automatic restarts.--restart=unless-stopped policy, or configure Kubernetes restartPolicy.Automatic restarts are the foundation of service stability, but more importantly, you must find and fix the root cause of the crashes to avoid restart loops.
What metrics need to be monitored in production?
Production monitoring should cover four dimensions:
Set response priorities based on alert severity: Critical handled immediately, Warning investigated as soon as possible, and Info reviewed daily. Monitoring should be set up from day one—not added only after problems occur.
Should server scaling be vertical or horizontal?
Both scaling approaches have their own use cases:
| Scaling Type | Suitable Scenarios | Advantages | Limitations |
|---|---|---|---|
| Vertical scaling | Single-instance performance bottlenecks, quick emergency response | Simple configuration, no architecture change needed | High cost, capped ceiling, single point of failure |
| Horizontal scaling | Long-term growth, distributed architecture | Sustainable scaling, high availability | Requires load balancing and more complex state management |
Practical recommendation: For small projects, start with vertical scaling to ship quickly; for medium to large projects, prefer horizontal scaling to ensure long-term sustainability. For scenarios with significant traffic fluctuations, configure automatic scaling strategies.
How do you implement zero-downtime service restarts?
Zero-downtime reload methods:
pm2 reload command to restart processes in the cluster one by one.The core idea is: always keep some instances running, gradually replace all instances, and ensure the service remains continuously available. This requires at least 2 instances and correct load balancer configuration.
What is the priority order for troubleshooting server failures?
A systematic troubleshooting order helps you locate issues quickly:
Following this order prevents wasting time on issues that may have already been ruled out. Remember: most clues can be found in the logs.
Who is this skill for?
This skill is suitable for:
This skill emphasizes “teaching people how to fish”: it teaches you how to think, not just memorizing specific commands and configuration.