进程写文件时, 文件并没有真正写到存储设备, 而是写到了page cache中。 文件系统会定期把脏页写到存储设备, 进程也可以调用sync
这样的调用把脏页写回存储设设备。
数据结构
backing_dev_info
要理解这个结构体, 得从它需要解决什么问题说起。早期版本的Linux采用pdflush线程将“脏“页面回写到磁盘。 pdflush 线程时”通用“的, 也就是系统中所有的请求队列公用这些有限的资源, 这样如果系统中存在多个队列, 可能就会存在资源上的竞争, 导致IO速率的下降。
我们希望每个请求队列都有一个 flush 线程, 于是就有了backing_dev_info
。每个请求队列通过这个结构体管理自己的回写线程。 其主要数据结构如下:
include/linux/backing-dev-defs.h
struct backing_dev_info {
........
struct list_head bdi_list;
.......
struct bdi_writeback wb; /* the root writeback info for this bdi */
......
};
bdi_list
: 用来把所有backing_dev_info
链接到全局链表bdi_list
wb
: 用于控制回写行为的核心成员, 后面详细介绍
bdi_writeback
bdi_writeback
的定义如下:
struct bdi_writeback {
...
struct list_head b_dirty; /* dirty inodes */
struct list_head b_io; /* parked for writeback */
...
struct delayed_work dwork; /* work item used for writeback */
struct list_head work_list;
...
};
b_dirty
: 用来存放文件系统中所有的脏页
b_io
: 用来存放准备写回到存储设备的inode
dwork
: 负责把脏页写回到存储设备。 对应的函数时wb_workfn
work_list
:每一个回写任务为一个work, 会被链接到这个链表上来
mark inode dirty
当inode的属性或者数据改变时, 需要标记该inode dirty, 把其放到bdi_writeback
的b_dirty
链表中。 其主要函数是__mark_inode_dirty
fs/fs-writeback.c
void __mark_inode_dirty(struct inode *inode, int flags)
{
struct super_block *sb = inode->i_sb;
...
/*
* Paired with smp_mb() in __writeback_single_inode() for the
* following lockless i_state test. See there for details.
*/
smp_mb();
if (((inode->i_state & flags) == flags) ||
(dirtytime && (inode->i_state & I_DIRTY_INODE))) /* 1 */
return;
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;
inode_attach_wb(inode, NULL);
....
inode->i_state |= flags; /* 2 */
.....
/*
* If the inode was already on b_dirty/b_io/b_more_io, don't
* reposition it (that would break b_dirty time-ordering).
*/
if (!was_dirty) {
struct bdi_writeback *wb;
struct list_head *dirty_list;
bool wakeup_bdi = false;
wb = locked_inode_to_wb_and_lock_list(inode);
WARN((wb->bdi->capabilities & BDI_CAP_WRITEBACK) &&
!test_bit(WB_registered, &wb->state),
"bdi-%s not registered\n", bdi_dev_name(wb->bdi));
inode->dirtied_when = jiffies;
if (dirtytime)
inode->dirtied_time_when = jiffies;
if (inode->i_state & I_DIRTY)
dirty_list = &wb->b_dirty;
else
dirty_list = &wb->b_dirty_time;
wakeup_bdi = inode_io_list_move_locked(inode, wb,
dirty_list); /* 3 */
spin_unlock(&wb->list_lock);
trace_writeback_dirty_inode_enqueue(inode);
/*
* If this is the first dirty inode for this bdi,
* we have to wake-up the corresponding bdi thread
* to make sure background write-back happens
* later.
*/
if (wakeup_bdi &&
(wb->bdi->capabilities & BDI_CAP_WRITEBACK))
wb_wakeup_delayed(wb); /* 4 */
return;
}
}
out_unlock_inode:
spin_unlock(&inode->i_lock);
}
(1) 判断flags 是否已经设置, 如果是, 直接放回。 这里的flags是I_DIRTY
或者是I_DIRTY_SYNC
(2) 设置inode 的i_state
(3) 将inode 链接到bdi_writeback
对应的链表里面
(4) 如果是b_dirty
链表为空, 第三步wakeup_bdi 会返回true, 这时候会wakeup bdi_writeback
中的delay work, 开始执行回写操作。
周期回写
系统会定时去回写dirty page。这一节描述该具体流程
回写相关的入口都在 wb_workfn
, mark inode dirty 或者sync触发之后, 都会调用到这里来。
fs/fs-writeback.c
void wb_workfn(struct work_struct *work)
{
struct bdi_writeback *wb = container_of(to_delayed_work(work),
struct bdi_writeback, dwork);
long pages_written;
set_worker_desc("flush-%s", bdi_dev_name(wb->bdi));
current->flags |= PF_SWAPWRITE;
if (likely(!current_is_workqueue_rescuer() ||
!test_bit(WB_registered, &wb->state))) {
/*
* The normal path. Keep writing back @wb until its
* work_list is empty. Note that this path is also taken
* if @wb is shutting down even when we're running off the
* rescuer as work_list needs to be drained.
*/
do {
pages_written = wb_do_writeback(wb); /* 1 */
trace_writeback_pages_written(pages_written);
} while (!list_empty(&wb->work_list));
} else {
/*
* bdi_wq can't get enough workers and we're running off
* the emergency worker. Don't hog it. Hopefully, 1024 is
* enough for efficient IO.
*/
pages_written = writeback_inodes_wb(wb, 1024,
WB_REASON_FORKER_THREAD);
trace_writeback_pages_written(pages_written);
}
if (!list_empty(&wb->work_list)) /* 2 */
wb_wakeup(wb);
else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
wb_wakeup_delayed(wb);
current->flags &= ~PF_SWAPWRITE;
}
(1) 调用wb_do_writeback
执行后续的操作
(2) 如果还有新的work(wb->work_list) 或者有新的io(wb_has_dirty_io(wb)) , 继续调度自己。
wb_do_writeback
是处理回写的关键函数, 其定义如下
static long wb_do_writeback(struct bdi_writeback *wb)
{
struct wb_writeback_work *work;
long wrote = 0;
set_bit(WB_writeback_running, &wb->state);
while ((work = get_next_work_item(wb)) != NULL) { /* 1 */
trace_writeback_exec(wb, work);
wrote += wb_writeback(wb, work);
finish_writeback_work(wb, work);
}
/*
* Check for a flush-everything request
*/
wrote += wb_check_start_all(wb);
/*
* Check for periodic writeback, kupdated() style
*/
wrote += wb_check_old_data_flush(wb); /* 2 */
wrote += wb_check_background_flush(wb); /* 3 */
clear_bit(WB_writeback_running, &wb->state);
return wrote;
}
(1) 处理当前的work。sync等同步调用就会给wb添加一个wb_writeback_work
, 然后在这里之行
(2) 周期回写的入口, 检查是否有dirty page到期
(3) 后台回写的入口, 如果此时的dity page 超过了系统的限制, 会进行回写。
下面我们主要看第二步, 周期回写的逻辑。
fs/fs-writeback.c
wb_workfn->wb_do_writeback->wb_check_old_data_flush
static long wb_check_old_data_flush(struct bdi_writeback *wb)
{
unsigned long expired;
long nr_pages;
/*
* When set to zero, disable periodic writeback
*/
if (!dirty_writeback_interval)
return 0;
expired = wb->last_old_flush +
msecs_to_jiffies(dirty_writeback_interval * 10);
if (time_before(jiffies, expired)) /* 1 */
return 0;
wb->last_old_flush = jiffies;
nr_pages = get_nr_dirty_pages();
if (nr_pages) {
struct wb_writeback_work work = {
.nr_pages = nr_pages,
.sync_mode = WB_SYNC_NONE,
.for_kupdate = 1,
.range_cyclic = 1,
.reason = WB_REASON_PERIODIC,
};
return wb_writeback(wb, &work); /* 2 */
}
return 0;
}
(1) 检查回写周期是否到期。这个是一个可配的参数dirty_writeback_interval的配置在 /proc/sys/vm/dirty_writeback_centisecs。 单位是10ms, 默认配置是500, 即超时时间是5s。
(2) 构建wb_writeback_work
, 调用wb_writeback
执行后续的操作。
wb_writeback
无论是周期回写, 后台回写,还是sync调用,其本质上都是构建一个wb_writeback_work
,然后调用wb_writeback
去执行。
我们首先看下wb_writeback_work
的定义
fs/fs-writeback.c
struct wb_writeback_work {
long nr_pages;
struct super_block *sb;
enum writeback_sync_modes sync_mode;
unsigned int tagged_writepages:1;
unsigned int for_kupdate:1;
unsigned int range_cyclic:1;
unsigned int for_background:1;
unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
unsigned int auto_free:1; /* free on completion */
enum wb_reason reason; /* why was writeback initiated? */
struct list_head list; /* pending work list */
st
nr_pages
: 此次work需要回写的总页数
writeback_sync_modes
:是否是同步模式
for_kupdate
:是否是周期回写。周期回写时, 这里会置1
for_background
:是否是后台回写。后台回写时, 这里会置1
for_sync
: 是否是sync调用。sync调用时, 这里会置1
wb_writeback
的具体流程如下:
fs/fs-writeback.c
wb_workfn->wb_do_writeback->wb_check_old_data_flush->wb_writeback
static long wb_writeback(struct bdi_writeback *wb,
struct wb_writeback_work *work)
{
unsigned long wb_start = jiffies;
long nr_pages = work->nr_pages;
unsigned long dirtied_before = jiffies;
struct inode *inode;
long progress;
struct blk_plug plug;
blk_start_plug(&plug);
spin_lock(&wb->list_lock);
for (;;) {
/*
* Stop writeback when nr_pages has been consumed
*/
if (work->nr_pages <= 0) /* 1 */
break;
/*
* Background writeout and kupdate-style writeback may
* run forever. Stop them if there is other work to do
* so that e.g. sync can proceed. They'll be restarted
* after the other works are all done.
*/
if ((work->for_background || work->for_kupdate) &&
!list_empty(&wb->work_list)) /* 2 */
break;
/*
* For background writeout, stop when we are below the
* background dirty threshold
*/
if (work->for_background && !wb_over_bg_thresh(wb)) /* 3 */
break;
/*
* Kupdate and background works are special and we want to
* include all inodes that need writing. Livelock avoidance is
* handled by these works yielding to any other work so we are
* safe.
*/
if (work->for_kupdate) { /* 4 */
dirtied_before = jiffies -
msecs_to_jiffies(dirty_expire_interval * 10);
} else if (work->for_background)
dirtied_before = jiffies;
trace_writeback_start(wb, work);
if (list_empty(&wb->b_io))
queue_io(wb, work, dirtied_before); /* 5 */
if (work->sb)
progress = writeback_sb_inodes(work->sb, wb, work); /* 6 */
else
progress = __writeback_inodes_wb(wb, work);
trace_writeback_written(wb, work);
}
spin_unlock(&wb->list_lock);
blk_finish_plug(&plug);
return nr_pages - work->nr_pages;
}
(1) 如果回写任务已经完成, 退出循环
(2) 如果当前是周期回写或者后台回写, 此时如果来了其他的work, 直接返回,避免其他更高优先级的work被阻塞。
(3) 如果是后台回写, 且当前没有达到后台回写的阈值,直接返回。
(4) 如果是周期回写, 设置超时时间。如果是其他情况, dirtied_before设置的是当前时间,也就是所有inode 肯定会超时。
(5) 把超时的inode转移到b_io链表中。
(6) 调用writeback_sb_inodes
或者__writeback_inodes_wb
执行后续的工作。
writeback_sb_inodes
和__writeback_inodes_wb
两者的逻辑类似, 都是遍历wb->b_io
, 执行__writeback_single_inode
。
fs/fs-writeback.c
static int
__writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
{
struct address_space *mapping = inode->i_mapping;
long nr_to_write = wbc->nr_to_write;
unsigned dirty;
int ret;
WARN_ON(!(inode->i_state & I_SYNC));
trace_writeback_single_inode_start(inode, wbc, nr_to_write);
ret = do_writepages(mapping, wbc); /* 1 */
/*
* Make sure to wait on the data before writing out the metadata.
* This is important for filesystems that modify metadata on data
* I/O completion. We don't do it for sync(2) writeback because it has a
* separate, external IO completion path and ->sync_fs for guaranteeing
* inode metadata is written back correctly.
*/
if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) { /* 2 */
int err = filemap_fdatawait(mapping);
if (ret == 0)
ret = err;
}
.....
int err = write_inode(inode, wbc); /* 3 */
if (ret == 0)
ret = err;
trace_writeback_single_inode(inode, wbc, nr_to_write);
return ret;
}
(1) 调用 do_writepages
回写文件的数据
(2) 等待数据写入完成
(3) 调用write_inode
写入文件的元数据。 可以看到这里是保证数据是先于元数据写入的。这里会调用到s_op->write_inode
后台回写
当dirty page 的数量超过后台回写的阈值时, 系统开始执行后台回写。
这个阈值是 /proc/sys/vm/dirty_backgroud_ratio, 是脏页占整体可用内存的比例。或者是/proc/sys/vm/dirty_backgroud_bytes, 脏页最大的字节数, 这两个是互斥的关系。
在wb_do_writeback
中, 会调用wb_check_background_flush
执行后台回写
static long wb_check_background_flush(struct bdi_writeback *wb)
{
if (wb_over_bg_thresh(wb)) {
struct wb_writeback_work work = {
.nr_pages = LONG_MAX,
.sync_mode = WB_SYNC_NONE,
.for_background = 1,
.range_cyclic = 1,
.reason = WB_REASON_BACKGROUND,
};
return wb_writeback(wb, &work);
}
return 0;
}
这个逻辑和周期回写一样, 构建一个wb_writeback_work
, 然后调用wb_writeback
执行work。
这里通过wb_over_bg_thresh
判断当前是否超过了后台回写阈值。
系统调用sync
用户层可以主调调用 sync, 将系统中所有的脏数据写回。 sync的入口如下。
fs/sync.c
SYSCALL_DEFINE0(sync)
{
ksys_sync();
return 0;
}
void ksys_sync(void)
{
int nowait = 0, wait = 1;
wakeup_flusher_threads(WB_REASON_SYNC); /* 1 */
iterate_supers(sync_inodes_one_sb, NULL); /* 2 */
iterate_supers(sync_fs_one_sb, &nowait); /* 3 */
iterate_supers(sync_fs_one_sb, &wait); /* 4 */
iterate_bdevs(fdatawrite_one_bdev, NULL); /* 5 */
iterate_bdevs(fdatawait_one_bdev, NULL); /* 6 */
if (unlikely(laptop_mode))
laptop_sync_completion();
}
(1) 唤醒所有的bdi 。这里的逻辑是如果bdi 里有dirty_page, 就唤醒它让它去执行回写。
(2) 遍历所有sb, 执行sync_inodes_one_sb
,这里会构建wb_writeback_work
,添加到对应wb的work_list中, 并等待其执行3完毕
(3) 遍历所有sb , 执行sync_fs, 不等待执行完。
(4) 遍历所有sb, 执行sync_fs, 等待执行完文章来源:https://www.toymoban.com/news/detail-473814.html
(5)(6) 遍历所有block_device, 将其缓存写入文章来源地址https://www.toymoban.com/news/detail-473814.html
到了这里,关于Linux内核sync流程的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!