Server Management - Server Management Principles and Operations & Maintenance Decision Guide

Server Management - Server Management Principles and Operations Decision Guide

Skill Overview

Server Management provides the core principles and decision framework for managing servers in production environments. It helps you build the right operational mindset—learning how to think rather than memorizing commands.

Applicable Scenarios

Production deployment management

When you need to deploy applications to production, this skill offers a complete decision framework—from process management to monitoring and alerting. It helps you choose the right tools (PM2, systemd, Docker, Kubernetes) and establish a stable service management plan.

Troubleshooting server operations issues

When services behave abnormally, performance drops, or resources become tight in production, it provides a systematic troubleshooting priority order: check process status first, then analyze logs, evaluate resource usage next, and finally check the network and dependent services.

Scaling out and architecture optimization

When facing performance bottlenecks such as excessively high CPU, memory overflows, or slow responses, it helps you decide when to scale vertically versus horizontally, and how to design automatic scaling strategies to handle traffic fluctuations.

Core Features

Choosing process management tools

Based on the application type and tech stack, it provides decision guidance for process management solutions: Node.js applications recommend PM2 (supports clustering and hot reloads), general applications use native Linux systemd, containerized environments use Docker/Podman, and large-scale orchestration uses Kubernetes or Docker Swarm. It enables crash auto-restart, zero-downtime hot reloads, multi-core clustering, and persistent runtime.

Monitoring strategy design

Guides you to build a complete monitoring system: availability monitoring (uptime, health checks), performance monitoring (response time, throughput), error monitoring (error rates and types), and resource monitoring (CPU, memory, disk). Depending on your needs, choose suitable monitoring setups: for simple scenarios, use PM2 metrics and htop; for full-stack observability, use Grafana or Datadog; for error tracking, use Sentry; and for availability monitoring, use UptimeRobot or Pingdom.

Log management best practices

Clarifies log categories (application logs for debugging and auditing, access logs for traffic analysis, error logs for issue detection) and core principles: log rotation prevents disks from filling up, structured logs (JSON format) are easier to parse, set appropriate log levels, and prohibit recording sensitive data.

Scaling decision framework

Matches symptoms with solutions: high CPU means adding instances (horizontal scaling), high memory means adding memory or fixing leaks, slow responses require performance analysis before scaling, and traffic spikes call for automatic scaling. Understand the use cases and trade-offs of vertical scaling (quick fixes, single instance) versus horizontal scaling (sustainable, distributed).

Health check design

Defines multi-dimensional checks for service health: HTTP 200 responses, normal database connections, reachable dependent services, and no resource exhaustion. Choose simple checks (return 200 only) or deep checks (verify all dependencies) based on the needs of your load balancer.

Common Questions

What tools should be used for server management?

Tool selection depends on your application type and tech stack:

Node.js applications: PM2 is preferred. It supports multi-core utilization via built-in cluster mode, and provides zero-downtime reloads and automatic restarts.

General Linux applications: Use systemd. It is system-native, easy to configure, and highly stable.

Containerized applications: Use Docker or Podman to ensure environment consistency and simplify migration.

Large-scale orchestration: Use Kubernetes or Docker Swarm. They support automatic scaling in/out and service discovery.

The key is not to memorize every tool, but to understand the applicable scenarios and trade-offs for each.

How can I implement automatic server process restarts?

Different tools have different configurations:

PM2: Auto-restart is supported by default. Start it with pm2 start app.js --name myapp.

systemd: In the service file, set Restart=always and RestartSec=10 to enable automatic restarts.

Docker: Use the --restart=unless-stopped policy, or configure Kubernetes restartPolicy.

Automatic restarts are the foundation of service stability, but more importantly, you must find and fix the root cause of the crashes to avoid restart loops.

What metrics need to be monitored in production?

Production monitoring should cover four dimensions:

Availability: uptime, health check success rate

Performance: response time (P50, P95, P99), request throughput

Errors: error rate, error type distribution, exception stack traces

Resources: CPU usage, memory usage, disk space, network traffic

Set response priorities based on alert severity: Critical handled immediately, Warning investigated as soon as possible, and Info reviewed daily. Monitoring should be set up from day one—not added only after problems occur.

Should server scaling be vertical or horizontal?

Both scaling approaches have their own use cases:

Scaling Type	Suitable Scenarios	Advantages	Limitations
Vertical scaling	Single-instance performance bottlenecks, quick emergency response	Simple configuration, no architecture change needed	High cost, capped ceiling, single point of failure
Horizontal scaling	Long-term growth, distributed architecture	Sustainable scaling, high availability	Requires load balancing and more complex state management

Practical recommendation: For small projects, start with vertical scaling to ship quickly; for medium to large projects, prefer horizontal scaling to ensure long-term sustainability. For scenarios with significant traffic fluctuations, configure automatic scaling strategies.

How do you implement zero-downtime service restarts?

Zero-downtime reload methods:

PM2: Use the pm2 reload command to restart processes in the cluster one by one.

systemd: Configure multiple instances and update them one by one in combination with a load balancer.

Kubernetes: Use the Rolling Update strategy to gradually replace Pods.

Docker Swarm: Use a rolling update strategy.

The core idea is: always keep some instances running, gradually replace all instances, and ensure the service remains continuously available. This requires at least 2 instances and correct load balancer configuration.

What is the priority order for troubleshooting server failures?

A systematic troubleshooting order helps you locate issues quickly:

Check process status — Is the service still running?

Check log files — What error messages are present?

Check resource usage — Is CPU, memory, or disk exhausted?

Check network connectivity — Are ports open, and is DNS working correctly?

Check dependent services — Are the database and external APIs reachable?

Following this order prevents wasting time on issues that may have already been ruled out. Remember: most clues can be found in the logs.

Who is this skill for?

This skill is suitable for:

Operations beginners: Build the right server management mindset framework and avoid blindly running commands.

Backend developers: Understand production environment management principles and collaborate more effectively with operations teams.

DevOps engineers: Quickly review key decision points to ensure architectural designs are reasonable.

Technology decision-makers: Understand the trade-offs of different tools and solutions to make informed technology choices.

This skill emphasizes “teaching people how to fish”: it teaches you how to think, not just memorizing specific commands and configuration.

server-management

Author

Category

Install