2025年12月07日/ 浏览 3
正文:
凌晨2点的告警短信像催命符般震动手机——生产数据库突然拒绝连接。顶着困意打开终端,一场与PostgreSQL的搏斗就此展开。
▌ 阶段一:连接风暴的真相
当psql -h 127.0.0.1 -U postgres反复返回Connection refused时,第一反应不是直接重启服务,而是先检查三个关键点:
bash
ss -tuln | grep 5432
firewall-cmd –list-ports | grep 5432
ps -ef | grep postgres | grep -v grep
发现端口处于LISTEN状态但防火墙未放行,原来同事下午调整安全策略时漏掉了数据库端口。临时解决:
bash
firewall-cmd --add-port=5432/tcp --permanent
firewall-cmd --reload
▌ 阶段二:查询陷入泥潭
刚解决连接问题,监控面板突然显示CPU飙升至90%。用pg_top定位到某个账单生成进程长期占据资源:
sql
SELECT pid, query_start, query
FROM pg_stat_activity
WHERE state = 'active' ORDER BY query_start;
发现一条每月执行的报表查询竟运行了2小时!进一步用EXPLAIN ANALYZE抓取执行计划,揭示出缺失的索引:
sql
CREATE INDEX CONCURRENTLY idx_order_date ON orders(created_at);
避坑指南:在亿级数据表创建索引务必使用
CONCURRENTLY,否则会触发表级锁阻塞业务。
▌ 阶段三:神秘的锁争夺战
凌晨4点,客服系统突然报出订单更新异常。通过锁监控视图精准定位死锁链条:
sql
SELECT
blocked.pid AS blocked_pid,
blocking.pid AS blocking_pid,
blocked.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked ON blocked.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
强制终止阻塞进程后,立即修改应用程序逻辑:将长事务拆解为小批量提交,并增设锁等待超时参数:
postgresql
SET lock_timeout = '2s';
▌ 深度调优:从救火到防火
天亮前最后一步——预防性检查:
1. 日志挖矿:放大/var/lib/pgsql/12/data/pg_log/postgresql-*.log中的WARNING级日志
2. 空间定时炸弹:设置监控告警规则
bash
df -h /var/lib/pgsql # 存储空间
SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10; # 慢查询TOP10
3. 内存暗流:调整shared_buffers为物理内存的25%,并监控vmstat的si/so交换值
当晨光透过机房窗户时,数据库指标终于恢复平稳。这场战役教会我:真正的故障排查不是背命令,而是像侦探一样串联证据链——从网络层到SQL层,从系统指标到事务逻辑,每一步都需要用数据说话。留下最后一道锦囊:定期执行pg_prewarm预热常用数据,让危机化解在黎明之前。