Redis AOF 问题

背景

目前线上的某几组Redis Group开启了AOF做数据持久化,使用了默认的appendfsync配置,即everysec。前段时间某次操作时,向某一个Redis Master A发起slaveof请求,导致这台机器上的Redis Master A和另外一个Redis Master B出现了服务不可用。

现象

两台Redis实例都有十几秒种服务不可用,而且都出现如下错误日志:

1
Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

按照这个日志的提示,this may slow down Redis,可能会拖慢Redis服务,但是AOF fsync是异步线程执行,应该不会对正常的工作线程产生影响?

带着这个问题,做了以下实验。

场景模拟

问题可以抽象为磁盘慢对Redis实例的影响,模拟工具方面,甘大神推荐了Device Mapper里面的工具: dm-delay

实验环境

os: Debian 3.16.36-1+deb8u2
gcc: 4.9.2
redis: 2.8.21
kernel-source: linux-3.16.39

详细步骤

生成本地文件local-device,然后映射成本地块设备/dev/loop0

1
2
dd if=/dev/zero of=/root/redis-perf/local-device bs=4k count=256k
losetup /dev/loop0 /root/redis-perf/local-device

使用dmsetup设置块设备的延时

设置写延时为5s

1
2
cd /root/redis-perf
./block-dev.sh /dev/loop0 0 5000

脚本block-dev.sh的内容如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
└─[$] cat block-dev.sh
#!/bin/bash
if [ $# -lt 3 ] ; then
echo "usage: $0 \$device \$read_delay_ms \$write_delay_ms"
exit -1
fi
# Size in 512-bytes sectors
size=$(blockdev --getsize $1)
# Create a block devic
echo "0 $size delay $1 0 $2 $1 0 $3" | dmsetup create delayed

挂载到目录dir-on-delayed-device

1
mount /dev/dm-0 /root/redis-perf/dir-on-delayed-device

查看挂载之后的结果

1
2
3
4
5
6
└─[$] mount |grep dir-on-delayed-device
/dev/mapper/delayed on /root/redis-perf/dir-on-delayed-device type ext2 (rw,relatime)
└─[$] df -lh |head -1; df -lh |grep dir-on-delayed-device
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/delayed 1008M 1.3M 956M 1% /root/redis-perf/dir-on-delayed-device

启动Redis

使用上面的dir-on-delayed-device作为工作目录dir

Redis配置如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
└─[$] cat redis-7379.conf
port 7379
dir /root/redis-perf/dir-on-delayed-device
daemonize yes
pidfile redis.pid
loglevel notice
logfile redis.log
maxmemory 256m
maxmemory-policy allkeys-lru
appendonly yes
appendfilename appendonly.aof
appendfsync everysec
auto-aof-rewrite-percentage 0
save ""
dbfilename dump.rdb

启动

1
2
cd /root/redis-2.8.21
./src/redis-server redis-7379.conf

尝试重现错误

使用routine.sh脚本向Redis中写入数据,并统计响应时间,响应时间大于300ms的,就打印日志;其余的快速闪过。

1
2
cd /root/redis-perf
./routine.sh localhost 7379

脚本routine.sh内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
└─[$] cat routine.sh
#!/bin/bash
if [ $# -lt 2 ] ; then
echo "usage: $0 \$redis-ip \$redis-port"
exit -1
fi
while true
do
st=$(($(date +%s%N)/1000000))
redis-cli -h $1 -p $2 set foo bar > /dev/null
et=$(($(date +%s%N)/1000000))
((cost=et-st))
if [ $cost -gt 300 ]; then
echo ""
echo "cost: $cost, "`date +%T`
else
echo -ne "\033[0K\rcost: $cost"
# sleep 5
fi
done

Redis的日志里面出现了和之前相同的错误:

1
2
3
4
[9504] 10:27:21.067 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[9504] 10:27:23.075 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[9504] 10:27:25.079 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[9504] 10:27:27.087 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

每隔2s会打印一次内容

另外从routine.sh的运行结果可以看到:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
└─[$] ./routine.sh localhost 7379 [10:24:45]
cost: 3
cost: 8790, 10:27:43
cost: 2
cost: 8800, 10:28:08
cost: 3
cost: 8797, 10:28:33
cost: 7
cost: 8844, 10:28:58
cost: 2
cost: 9893, 10:29:23
cost: 2
cost: 8899, 10:29:43
cost: 3
cost: 9119, 10:30:04
cost: 2
cost: 9990, 10:30:29
cost: 3
cost: 8093, 10:30:54
cost: 2
cost: 8087, 10:31:19
cost: 2
cost: 8086, 10:31:44
cost: 3
cost: 4183, 10:31:59
cost: 2
cost: 8189, 10:32:24
cost: 3
cost: 8197, 10:32:49
cost: 4
cost: 8301, 10:33:14
cost: 3
cost: 8302, 10:33:39
cost: 2
cost: 8401, 10:34:04

慢请求出现的时间间隔有些规律,但也没有非常严格的规律;出现的间隔大部分是25s,也有15s20s的;慢请求的响应时间大部分在8s左右,也有4s10s的。

产生异常日志、影响服务,这两个问题在上面的实验中都出现了,下面来尝试解释上面实验中出现的现象。

现象解释

有以下三个问题需要回答:

  • 为什么是每2s打印一次错误日志this may slow down Redis
  • 为什么会出现慢请求
  • 为什么慢请求的响应时间和出现的间隔是上面那样

this may slow down Redis

先贴Redis相关的代码src/aof.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
sync_in_progress = bioPendingJobsOfType(REDIS_BIO_AOF_FSYNC) != 0;
if (server.aof_fsync == AOF_FSYNC_EVERYSEC && !force) {
/* With this append fsync policy we do background fsyncing.
* If the fsync is still in progress we can try to delay
* the write for a couple of seconds. */
if (sync_in_progress) {
if (server.aof_flush_postponed_start == 0) {
/* No previous write postponing, remember that we are
* postponing the flush and return. */
server.aof_flush_postponed_start = server.unixtime;
return;
} else if (server.unixtime - server.aof_flush_postponed_start < 2) {
/* We were already waiting for fsync to finish, but for less
* than two seconds this is still ok. Postpone again. */
return;
}
/* Otherwise fall trough, and go write since we can't wait
* over two seconds. */
server.aof_delayed_fsync++;
redisLog(REDIS_NOTICE,"Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.");
}
}
/* We want to perform a single write. This should be guaranteed atomic
* at least if the filesystem we are writing is a real physical one.
* While this will save us against the server being killed I don't think
* there is much to do about the whole server stopping for power problems
* or alike */
latencyStartMonitor(latency);
nwritten = write(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));
latencyEndMonitor(latency);

关注在appendfsyncAOF_FSYNC_EVERYSEC这个策略时的系统的表现:

AOF_FSYNC_EVERYSEC这个策略时,会先检查有没有线程正在做fsync,状态存储在变量sync_in_progress中,然后做以下判断

  • 如果有,而且这是第一次被postpone的调用,记录server.aof_flush_postponed_start, 返回
  • 如果有,而且postpone的时间不足2s,返回
  • 如果有,而且已经超过2s没执行完了,会增加记数server.aof_delayed_fsync
    打印日志,Asynchronous AOF fsync is taking too long (disk is busy?)并仍然去执行write

对于AOF_FSYNC_EVERYSEC来说,这个函数的末尾会判断是否调用aof_background_fsync,同一时
刻只能有一个异步线程做fsync

这段代码可以解释为什么是每隔2s打印一次日志。

慢请求

出现慢请求的原因是这个问题的关键。Redis作者antirez就上述问题也写过一篇类似的博文
问题可以简化为fsync操作会不会阻塞write操作。做下面一个实验,起两个线程,一个线程调用write,另外一个线程调用fsync,看write操作会不会被阻塞。

1
2
3
cd /root/redis-perf
gcc -o fsync-test fsync-test.cc -lpthread
./fsync-test /root/redis-perf/dir-on-delayed-device/test

fsync-test.cc的内容见此处

实验打印出来的日志为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
fsync test start
Thread write_thread created successfully..
Thread fsync_thread created successfully..
2016-12-10 05:44:16, 1, start to fsync
2016-12-10 05:44:21, 1, fsync cost: 5s
2016-12-10 05:44:21, 2, start to fsync
2016-12-10 05:44:31, 17916650, write cost: 5s
2016-12-10 05:44:36, 2, fsync cost: 15s
2016-12-10 05:44:36, 3, start to fsync
2016-12-10 05:44:46, 35689281, write cost: 5s
2016-12-10 05:44:51, 3, fsync cost: 15s
2016-12-10 05:44:51, 4, start to fsync
2016-12-10 05:45:01, 52569000, write cost: 5s
2016-12-10 05:45:06, 4, fsync cost: 15s
2016-12-10 05:45:06, 5, start to fsync
2016-12-10 05:45:16, 69821812, write cost: 5s
2016-12-10 05:45:21, 5, fsync cost: 15s
2016-12-10 05:45:21, 6, start to fsync
2016-12-10 05:45:31, 87693673, write cost: 5s
2016-12-10 05:45:36, 6, fsync cost: 15s
2016-12-10 05:45:36, 7, start to fsync
2016-12-10 05:45:46, 105313877, write cost: 5s
2016-12-10 05:45:51, 7, fsync cost: 15s
2016-12-10 05:45:51, 8, start to fsync
2016-12-10 05:46:01, 122647315, write cost: 5s
2016-12-10 05:46:06, 8, fsync cost: 15s
2016-12-10 05:46:06, 9, start to fsync
2016-12-10 05:46:16, 140379006, write cost: 5s
2016-12-10 05:46:21, 9, fsync cost: 15s
2016-12-10 05:46:21, 10, start to fsync

日志格式为:时间戳,write或者fsync操作的序号,打印信息。
试验结果表明:fsync确实可能阻塞write

查到这里有些卡壳了,请教了周围一些同学。浅小奕说,瞎猜不如看代码。内核组同学亚小方指出这个行为应该和具体文件系统实现是有关系的。

查看了下刚才测试目录使用的文件系统类型:

1
2
└─[$] findmnt -n -o FSTYPE -T /root/redis-perf/dir-on-delayed-device
ext2

先看fsync系统调用的实现,位于文件fs/sync.c中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
/**
* vfs_fsync - perform a fsync or fdatasync on a file
* @file: file to sync
* @datasync: only perform a fdatasync operation
*
* Write back data and metadata for @file to disk. If @datasync is
* set only metadata needed to access modified file data is written.
*/
int vfs_fsync(struct file *file, int datasync)
{
return vfs_fsync_range(file, 0, LLONG_MAX, datasync);
}
EXPORT_SYMBOL(vfs_fsync);
static int do_fsync(unsigned int fd, int datasync)
{
struct fd f = fdget(fd);
int ret = -EBADF;
if (f.file) {
ret = vfs_fsync(f.file, datasync);
fdput(f);
}
return ret;
}
SYSCALL_DEFINE1(fsync, unsigned int, fd)
{
return do_fsync(fd, 0);
}
SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
{
return do_fsync(fd, 1);
}

最终会调用vfs_fsync_range

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/**
* vfs_fsync_range - helper to sync a range of data & metadata to disk
* @file: file to sync
* @start: offset in bytes of the beginning of data range to sync
* @end: offset in bytes of the end of data range (inclusive)
* @datasync: perform only datasync
*
* Write back data in range @start..@end and metadata for @file to disk. If
* @datasync is set only metadata needed to access modified file data is
* written.
*/
int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
{
if (!file->f_op->fsync)
return -EINVAL;
return file->f_op->fsync(file, start, end, datasync);
}
EXPORT_SYMBOL(vfs_fsync_range);

这里可以看到,调用了file->f_op->fsyncext2相关的实现位于文件fs/ext2/file.c中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
const struct file_operations ext2_file_operations = {
.llseek = generic_file_llseek,
.read = new_sync_read,
.write = new_sync_write,
.read_iter = generic_file_read_iter,
.write_iter = generic_file_write_iter,
.unlocked_ioctl = ext2_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
.mmap = generic_file_mmap,
.open = dquot_file_open,
.release = ext2_release_file,
.fsync = ext2_fsync,
.splice_read = generic_file_splice_read,
.splice_write = iter_file_splice_write,
};
...
int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
{
int ret;
struct super_block *sb = file->f_mapping->host->i_sb;
struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
ret = generic_file_fsync(file, start, end, datasync);
if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
/* We don't really know where the IO error happened... */
ext2_error(sb, __func__,
"detected IO error when writing metadata buffers");
ret = -EIO;
}
return ret;
}

generic_file_fsync的实现,位于文件fs/libfs.c中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
/**
* generic_file_fsync - generic fsync implementation for simple filesystems
* with flush
* @file: file to synchronize
* @start: start offset in bytes
* @end: end offset in bytes (inclusive)
* @datasync: only synchronize essential metadata if true
*
*/
int generic_file_fsync(struct file *file, loff_t start, loff_t end,
int datasync)
{
struct inode *inode = file->f_mapping->host;
int err;
err = __generic_file_fsync(file, start, end, datasync);
if (err)
return err;
return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
}
EXPORT_SYMBOL(generic_file_fsync);
/**
* __generic_file_fsync - generic fsync implementation for simple filesystems
*
* @file: file to synchronize
* @start: start offset in bytes
* @end: end offset in bytes (inclusive)
* @datasync: only synchronize essential metadata if true
*
* This is a generic implementation of the fsync method for simple
* filesystems which track all non-inode metadata in the buffers list
* hanging off the address_space structure.
*/
int __generic_file_fsync(struct file *file, loff_t start, loff_t end,
int datasync)
{
struct inode *inode = file->f_mapping->host;
int err;
int ret;
err = filemap_write_and_wait_range(inode->i_mapping, start, end);
if (err)
return err;
mutex_lock(&inode->i_mutex);
ret = sync_mapping_buffers(inode->i_mapping);
if (!(inode->i_state & I_DIRTY))
goto out;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
goto out;
err = sync_inode_metadata(inode, 1);
if (ret == 0)
ret = err;
out:
mutex_unlock(&inode->i_mutex);
return ret;
}
EXPORT_SYMBOL(__generic_file_fsync);

会对inode->i_mutex进行加锁,对于fdatasync,可能会不需要访问设备,跳出加锁区域:

1
2
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
goto out;

那么,write操作中是否有对inode->i_mutex加锁的过程呢?
write操作会调用函数generic_file_write_iter,位于文件mm/filemap.c中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/**
* generic_file_write_iter - write data to a file
* @iocb: IO state structure
* @from: iov_iter with data to write
*
* This is a wrapper around __generic_file_write_iter() to be used by most
* filesystems. It takes care of syncing the file in case of O_SYNC file
* and acquires i_mutex as needed.
*/
ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
ssize_t ret;
mutex_lock(&inode->i_mutex);
ret = __generic_file_write_iter(iocb, from);
mutex_unlock(&inode->i_mutex);
if (ret > 0) {
ssize_t err;
err = generic_write_sync(file, iocb->ki_pos - ret, ret);
if (err < 0)
ret = err;
}
return ret;
}
EXPORT_SYMBOL(generic_file_write_iter);

也会对inode进行加锁。
到这里,已经重现了慢请求,并且从代码层面上说明了慢请求出现的原因了

上述的实验中,fsync每次调用会耗时15sfsync开始后5swrite操作开始被阻塞;
假设fsync会有3个阶段A、B、C,每阶段耗时5s(设置的设备delay5s),但是只有B阶段会阻塞writeA、C是对write没影响的,才会出现上面的情况。

B阶段从上述代码中可以看到是__generic_file_fsync中:

1
2
3
4
5
6
7
8
9
10
11
12
13
mutex_lock(&inode->i_mutex);
ret = sync_mapping_buffers(inode->i_mapping);
if (!(inode->i_state & I_DIRTY))
goto out;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
goto out;
err = sync_inode_metadata(inode, 1);
if (ret == 0)
ret = err;
out:
mutex_unlock(&inode->i_mutex);

那么,A阶段和C阶段,fsync都是做了什么事情呢?
使用systemtap脚本拿到各个阶段的耗时,脚本内容类似于:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
global PROCNAME="fsync-test"
global i=0
global tb=0
global te=0
global func0="generic_file_fsync"
global func1="__generic_file_fsync"
global func2="filemap_write_and_wait_range"
probe kernel.function("generic_file_fsync") {
if(execname() == PROCNAME) {
tb=gettimeofday_us();
printf("begin: %s\n", execname());
}
}
probe kernel.function("generic_file_fsync").return {
if(execname() == PROCNAME) {
te=gettimeofday_us();
if (te - tb > 1000000) {
i++;
printf("time_us: %d\n", te-tb);
print_backtrace();
printf("end: %s\n", execname());
if(i > 5) {
exit();
}
}
}
}

分别拿到func0/func1/func2的耗时,过程比较人肉:(
得到结论:A阶段是filemap_write_and_wait_rangeC阶段是blkdev_issue_flush
到些,使用fsync-test实验产生的数据就都可以解释了。

慢请求的响应时间和出现的间隔

Redis的测试中慢请求的响应时间和出现间隔不太规律:

慢请求出现的时间间隔有些规律,但也没有非常严格的规律;出现的间隔大部分是25s,也有15s20s的;慢请求的响应时间大部分在8s左右,也有4s10s的。

这个数据目前没有办法完全解释清楚。和fsync-test的区别在于,fsync-test始终是写的同一个位置的四字节,Redis aof是追加写;做aof的过程中可能触发了系统的pdflush来刷脏页,可能和这些有因素一定的关系。

参考

[1] https://www.kernel.org/doc/Documentation/device-mapper/delay.txt
[2] http://www.enodev.fr/posts/emulate-a-slow-block-device-with-dm-delay.html
[3] https://github.com/axboe/fio
[4] http://www.cnblogs.com/wuchanming/p/4820830.html
[5] http://www.thegeekstuff.com/2012/04/create-threads-in-linux
[6] http://blog.yufeng.info/archives/1283