server-management

Server management principles and decision-making. Process management, monitoring strategy, and scaling decisions. Teaches thinking, not commands.

Author

Category

Other Tools

Install

Hot:7

Download and extract to your skills directory

Copy command and send to OpenClaw for auto-install:

Download and install this skill https://openskills.cc/api/download?slug=sickn33-skills-server-management&locale=en&source=copy

Server Management - Server Management Principles and Operations Decision Guide

Skill Overview


Server Management provides the core principles and decision framework for managing servers in production environments. It helps you build the right operational mindset—learning how to think rather than memorizing commands.

Applicable Scenarios

  • Production deployment management

  • When you need to deploy applications to production, this skill offers a complete decision framework—from process management to monitoring and alerting. It helps you choose the right tools (PM2, systemd, Docker, Kubernetes) and establish a stable service management plan.

  • Troubleshooting server operations issues

  • When services behave abnormally, performance drops, or resources become tight in production, it provides a systematic troubleshooting priority order: check process status first, then analyze logs, evaluate resource usage next, and finally check the network and dependent services.

  • Scaling out and architecture optimization

  • When facing performance bottlenecks such as excessively high CPU, memory overflows, or slow responses, it helps you decide when to scale vertically versus horizontally, and how to design automatic scaling strategies to handle traffic fluctuations.

    Core Features

  • Choosing process management tools

  • Based on the application type and tech stack, it provides decision guidance for process management solutions: Node.js applications recommend PM2 (supports clustering and hot reloads), general applications use native Linux systemd, containerized environments use Docker/Podman, and large-scale orchestration uses Kubernetes or Docker Swarm. It enables crash auto-restart, zero-downtime hot reloads, multi-core clustering, and persistent runtime.

  • Monitoring strategy design

  • Guides you to build a complete monitoring system: availability monitoring (uptime, health checks), performance monitoring (response time, throughput), error monitoring (error rates and types), and resource monitoring (CPU, memory, disk). Depending on your needs, choose suitable monitoring setups: for simple scenarios, use PM2 metrics and htop; for full-stack observability, use Grafana or Datadog; for error tracking, use Sentry; and for availability monitoring, use UptimeRobot or Pingdom.

  • Log management best practices

  • Clarifies log categories (application logs for debugging and auditing, access logs for traffic analysis, error logs for issue detection) and core principles: log rotation prevents disks from filling up, structured logs (JSON format) are easier to parse, set appropriate log levels, and prohibit recording sensitive data.

  • Scaling decision framework

  • Matches symptoms with solutions: high CPU means adding instances (horizontal scaling), high memory means adding memory or fixing leaks, slow responses require performance analysis before scaling, and traffic spikes call for automatic scaling. Understand the use cases and trade-offs of vertical scaling (quick fixes, single instance) versus horizontal scaling (sustainable, distributed).

  • Health check design

  • Defines multi-dimensional checks for service health: HTTP 200 responses, normal database connections, reachable dependent services, and no resource exhaustion. Choose simple checks (return 200 only) or deep checks (verify all dependencies) based on the needs of your load balancer.

    Common Questions

    What tools should be used for server management?

    Tool selection depends on your application type and tech stack:

  • Node.js applications: PM2 is preferred. It supports multi-core utilization via built-in cluster mode, and provides zero-downtime reloads and automatic restarts.

  • General Linux applications: Use systemd. It is system-native, easy to configure, and highly stable.

  • Containerized applications: Use Docker or Podman to ensure environment consistency and simplify migration.

  • Large-scale orchestration: Use Kubernetes or Docker Swarm. They support automatic scaling in/out and service discovery.
  • The key is not to memorize every tool, but to understand the applicable scenarios and trade-offs for each.

    How can I implement automatic server process restarts?

    Different tools have different configurations:

  • PM2: Auto-restart is supported by default. Start it with pm2 start app.js --name myapp.

  • systemd: In the service file, set Restart=always and RestartSec=10 to enable automatic restarts.

  • Docker: Use the --restart=unless-stopped policy, or configure Kubernetes restartPolicy.
  • Automatic restarts are the foundation of service stability, but more importantly, you must find and fix the root cause of the crashes to avoid restart loops.

    What metrics need to be monitored in production?

    Production monitoring should cover four dimensions:

  • Availability: uptime, health check success rate

  • Performance: response time (P50, P95, P99), request throughput

  • Errors: error rate, error type distribution, exception stack traces

  • Resources: CPU usage, memory usage, disk space, network traffic
  • Set response priorities based on alert severity: Critical handled immediately, Warning investigated as soon as possible, and Info reviewed daily. Monitoring should be set up from day one—not added only after problems occur.

    Should server scaling be vertical or horizontal?

    Both scaling approaches have their own use cases:

    Scaling TypeSuitable ScenariosAdvantagesLimitations
    Vertical scalingSingle-instance performance bottlenecks, quick emergency responseSimple configuration, no architecture change neededHigh cost, capped ceiling, single point of failure
    Horizontal scalingLong-term growth, distributed architectureSustainable scaling, high availabilityRequires load balancing and more complex state management

    Practical recommendation: For small projects, start with vertical scaling to ship quickly; for medium to large projects, prefer horizontal scaling to ensure long-term sustainability. For scenarios with significant traffic fluctuations, configure automatic scaling strategies.

    How do you implement zero-downtime service restarts?

    Zero-downtime reload methods:

  • PM2: Use the pm2 reload command to restart processes in the cluster one by one.

  • systemd: Configure multiple instances and update them one by one in combination with a load balancer.

  • Kubernetes: Use the Rolling Update strategy to gradually replace Pods.

  • Docker Swarm: Use a rolling update strategy.
  • The core idea is: always keep some instances running, gradually replace all instances, and ensure the service remains continuously available. This requires at least 2 instances and correct load balancer configuration.

    What is the priority order for troubleshooting server failures?

    A systematic troubleshooting order helps you locate issues quickly:

  • Check process status — Is the service still running?

  • Check log files — What error messages are present?

  • Check resource usage — Is CPU, memory, or disk exhausted?

  • Check network connectivity — Are ports open, and is DNS working correctly?

  • Check dependent services — Are the database and external APIs reachable?
  • Following this order prevents wasting time on issues that may have already been ruled out. Remember: most clues can be found in the logs.

    Who is this skill for?

    This skill is suitable for:

  • Operations beginners: Build the right server management mindset framework and avoid blindly running commands.

  • Backend developers: Understand production environment management principles and collaborate more effectively with operations teams.

  • DevOps engineers: Quickly review key decision points to ensure architectural designs are reasonable.

  • Technology decision-makers: Understand the trade-offs of different tools and solutions to make informed technology choices.
  • This skill emphasizes “teaching people how to fish”: it teaches you how to think, not just memorizing specific commands and configuration.