Ext4 Extent

Intro

文件系统中的file index将文件偏移量映射到存储介质中的具体位置。对于块设备文件系统而言,file index以block为最小单位,即负责维护文件逻辑块号LBN到设备物理块号PBN的映射。如下的文件不超过一个块大小,因此只包含一个逻辑块0,它被映射到物理块11645878。

1
2
3
4
5
6
> filefrag -v ~/.vimrc
Filesystem type is: ef53
File size of /home/stopire/.vimrc is 3317 (1 block of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 11645878.. 11645878: 1: last,eof
/home/stopire/.vimrc: 1 extent found

在ext{2,3}文件系统中,这个映射是通过inode中的多级数据块指针来实现的。这个经典实现对每个块都需要维护一条entry,在文件较大时会产生可观的元数据空间开销。

于是,ext4中引入了extent数据结构,每个extent可以映射一段连续的(最多\(2^{15}\)个)逻辑-物理块。文件拥有的所有物理块可以用若干个extent进行映射,并组织成树的形式,inode中只需要存储extent树的根即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
> filefrag -v ~/tmp.mp4
Filesystem type is: ef53
File size of tmp.mp4 is 1303990382 (318358 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 22527: 9213952.. 9236479: 22528:
1: 22528.. 24575: 9238528.. 9240575: 2048: 9236480:
2: 24576.. 26623: 7991296.. 7993343: 2048: 9240576:
3: 26624.. 34815: 22157312.. 22165503: 8192: 7993344:
4: 34816.. 36863: 22179840.. 22181887: 2048: 22165504:
5: 36864.. 40959: 22171648.. 22175743: 4096: 22181888:
6: 40960.. 47103: 14542848.. 14548991: 6144: 22175744:
7: 47104.. 61439: 6703104.. 6717439: 14336: 14548992:
8: 61440.. 86015: 6660096.. 6684671: 24576: 6717440:
9: 86016.. 114687: 6623232.. 6651903: 28672: 6684672:
10: 114688.. 116735: 6590464.. 6592511: 2048: 6651904:
11: 116736.. 120831: 6586368.. 6590463: 4096: 6592512:
12: 120832.. 145407: 6594560.. 6619135: 24576: 6590464:
13: 145408.. 178175: 6553600.. 6586367: 32768: 6619136:
14: 178176.. 180223: 6520832.. 6522879: 2048: 6586368:
15: 180224.. 208895: 6524928.. 6553599: 28672: 6522880:
16: 208896.. 237567: 6492160.. 6520831: 28672: 6553600:
17: 237568.. 270335: 6455296.. 6488063: 32768: 6520832:
18: 270336.. 280575: 6281216.. 6291455: 10240: 6488064:
19: 280576.. 311295: 6227968.. 6258687: 30720: 6291456:
20: 311296.. 317439: 6133760.. 6139903: 6144: 6258688:
21: 317440.. 318357: 6131743.. 6132660: 918: 6139904: last,eof
tmp.mp4: 22 extents found

Data Structure

描述extent的数据结构包括:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
/*
* This is the extent on-disk structure.
* It's used at the bottom of the tree.
*/
struct ext4_extent {
__le32 ee_block; /* first logical block extent covers */
__le16 ee_len; /* number of blocks covered by extent */
__le16 ee_start_hi; /* high 16 bits of physical block */
__le32 ee_start_lo; /* low 32 bits of physical block */
};

/*
* This is index on-disk structure.
* It's used at all the levels except the bottom.
*/
struct ext4_extent_idx {
__le32 ei_block; /* index covers logical blocks from 'block' */
__le32 ei_leaf_lo; /* pointer to the physical block of the next *
* level. leaf or next index could be there */
__le16 ei_leaf_hi; /* high 16 bits of physical block */
__u16 ei_unused;
};

/*
* Each block (leaves and indexes), even inode-stored has header.
*/
struct ext4_extent_header {
__le16 eh_magic; /* probably will support different formats */
__le16 eh_entries; /* number of valid entries */
__le16 eh_max; /* capacity of store in entries */
__le16 eh_depth; /* has tree real underlying blocks? */
__le32 eh_generation; /* generation of the tree */
};

这三个结构体均为12字节,[1]中描述了每个字段的作用。

  • Ext4支持最多32位的逻辑块号和48位的物理块号。
  • Extent被组织成树的形式,树中每个结点(除根结点)占据一个物理块,并以struct ext4_extent_header开始。
  • 中间结点存储若干个struct ext4_extent_idx结构,该结构存储起始LBN供二分查找,同时指向子结点所在的物理块。
  • 叶结点存储struct ext4_extent结构,该结构通过(LBN, PBN, length)描述一个extent。
  • 对于使用extent的文件,inode中的i_data(memory)/i_block(disk)字段(60字节)可以内联存放ext4_extent_header和最多4个ext4_extent,此时不需要使用额外的物理块来存储extent tree。
  • ext4_extent_header中的eh_depth其实类似height,即叶子结点的eh_depth是0,根结点的eh_depth最大。eh_depth的最大值是5,因为\(4*(\frac{1024}{12} - 1)^5 > 2^{32}\),其中1024是block size的最小值,\(2^{32}\)是LBN的最大值。

Code

ext4_ext_map_blocks()是建立extent映射的入口。该函数的语义是查找struct ext4_map_blocks指定的extent映射关系,如果指定了create flag,还需要创建这段extent映射。

1
2
int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags);

struct ext4_map_blocks定义如下。调用者指定需要查询的m_lblkm_lenext4_ext_map_blocks()需要找到对应的m_pblk,并重新设置m_len为该extent的真实长度(可能小于调用者要求的长度),m_flags用于返回extent的类型。

1
2
3
4
5
6
struct ext4_map_blocks {
ext4_fsblk_t m_pblk;
ext4_lblk_t m_lblk;
unsigned int m_len;
unsigned int m_flags;
};

首先,该函数使用ext4_find_extent()搜索当前文件的extent tree,试图寻找一个包含LBN的extent。ext4_find_extent()返回的extent与LBN存在多种可能的位置关系,会在下文说明。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags)
{
struct ext4_ext_path *path = NULL;
struct ext4_extent newex, *ex, ex2;
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
ext4_fsblk_t newblock = 0, pblk;
int err = 0, depth, ret;
unsigned int allocated = 0, offset = 0;
unsigned int allocated_clusters = 0;
struct ext4_allocation_request ar;
ext4_lblk_t cluster_offset;

ext_debug(inode, "blocks %u/%u requested\n", map->m_lblk, map->m_len);
trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);

/* find extent for this block */
path = ext4_find_extent(inode, map->m_lblk, NULL, 0);
if (IS_ERR(path)) {
err = PTR_ERR(path);
path = NULL;
goto out;
}

depth = ext_depth(inode);

/*
* consistent leaf must not be empty;
* this situation is possible, though, _during_ tree modification;
* this is why assert can't be put in ext4_find_extent()
*/
if (unlikely(path[depth].p_ext == NULL && depth != 0)) {
EXT4_ERROR_INODE(inode, "bad extent address "
"lblock: %lu, depth: %d pblock %lld",
(unsigned long) map->m_lblk, depth,
path[depth].p_block);
err = -EFSCORRUPTED;
goto out;
}

...
}

如果该文件拥有extent(4行),且lblk属于该extent(19行),说明待查找的LBN确实属于某个extent。此时需要根据extent中保存的PBN和长度信息重新设置map->m_pblkmap->m_len字段。

Unwritten extent是一种特殊类型的extent,是在fallocate()预分配空间时创建的,其目的是在应用读取这些位置时返回0,同时避免在存储介质上真正写入0。这里不关注有关逻辑。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
...

ex = path[depth].p_ext;
if (ex) {
ext4_lblk_t ee_block = le32_to_cpu(ex->ee_block);
ext4_fsblk_t ee_start = ext4_ext_pblock(ex);
unsigned short ee_len;


/*
* unwritten extents are treated as holes, except that
* we split out initialized portions during a write.
*/
ee_len = ext4_ext_get_actual_len(ex);

trace_ext4_ext_show_extent(inode, ee_block, ee_start, ee_len);

/* if found extent covers block, simply return it */
if (in_range(map->m_lblk, ee_block, ee_len)) {
newblock = map->m_lblk - ee_block + ee_start;
/* number of remaining blocks in the extent */
allocated = ee_len - (map->m_lblk - ee_block);
ext_debug(inode, "%u fit into %u:%d -> %llu\n",
map->m_lblk, ee_block, ee_len, newblock);

/*
* If the extent is initialized check whether the
* caller wants to convert it to unwritten.
*/
if ((!ext4_ext_is_unwritten(ex)) &&
(flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN)) {
err = convert_initialized_extent(handle,
inode, map, &path, &allocated);
goto out;
} else if (!ext4_ext_is_unwritten(ex)) {
map->m_flags |= EXT4_MAP_MAPPED;
map->m_pblk = newblock;
if (allocated > map->m_len)
allocated = map->m_len;
map->m_len = allocated;
ext4_ext_show_leaf(inode, path);
goto out;
}

ret = ext4_ext_handle_unwritten_extents(
handle, inode, map, &path, flags,
allocated, newblock);
if (ret < 0)
err = ret;
else
allocated = ret;
goto out;
}
}

...

如果该文件没有extent,同时flags没有指定要创建新extent,说明文件中此处会存在一个hole。该部分逻辑通过ext4_ext_put_gap_in_cache()将hole信息写入extent status tree,并将map->m_pblk置0返回,表示该LBN没有对应的PBN。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
...

/*
* requested block isn't allocated yet;
* we couldn't try to create block if create flag is zero
*/
if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
ext4_lblk_t hole_start, hole_len;

hole_start = map->m_lblk;
hole_len = ext4_ext_determine_hole(inode, path, &hole_start);
/*
* put just found gap into cache to speed up
* subsequent requests
*/
ext4_ext_put_gap_in_cache(inode, hole_start, hole_len);

/* Update hole_len to reflect hole size after map->m_lblk */
if (hole_start != map->m_lblk)
hole_len -= map->m_lblk - hole_start;
map->m_pblk = 0;
map->m_len = min_t(unsigned int, map->m_len, hole_len);

goto out;
}

...

如果flags中指定了EXT4_GET_BLOCKS_CREATE,就进入创建新extent的流程。

如果使用了bigalloc,通过get_implied_cluster_alloc()传入一个LBN邻近的extent,如果LBA恰好属于extent的头部/尾部从属的cluster,则可以直接在这个cluster中塞进一个新的extent以完成创建任务。

get_implied_cluster_alloc()在ex和ex2上跑了两次,因为这两个可能是不同的extent,分别在LBN的左右两侧。如果LBN在文件第一个extent之前,ex和ex2就是同一个extent了。

此外,我们还需要计算LBN左右两侧最近的已分配数据块的位置(22行,26行),这个信息会在分配器中用到。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
...

/*
* Okay, we need to do block allocation.
*/
newex.ee_block = cpu_to_le32(map->m_lblk);
cluster_offset = EXT4_LBLK_COFF(sbi, map->m_lblk);

/*
* If we are doing bigalloc, check to see if the extent returned
* by ext4_find_extent() implies a cluster we can use.
*/
if (cluster_offset && ex &&
get_implied_cluster_alloc(inode->i_sb, map, ex, path)) {
ar.len = allocated = map->m_len;
newblock = map->m_pblk;
goto got_allocated_blocks;
}

/* find neighbour allocated blocks */
ar.lleft = map->m_lblk;
err = ext4_ext_search_left(inode, path, &ar.lleft, &ar.pleft);
if (err)
goto out;
ar.lright = map->m_lblk;
err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2);
if (err < 0)
goto out;

/* Check if the extent after searching to the right implies a
* cluster we can use. */
if ((sbi->s_cluster_ratio > 1) && err &&
get_implied_cluster_alloc(inode->i_sb, map, &ex2, path)) {
ar.len = allocated = map->m_len;
newblock = map->m_pblk;
goto got_allocated_blocks;
}

...

调整分配参数,准备struct ext4_allocation_request结构体,调用ext4_mb_new_blocks()执行分配。

  • ext4_ext_find_goal()计算了一个PBN的hint,分配器尽量尝试使用这个hint作为分配的PBN,以增加文件的空间局部性。
  • 35行开始把分配请求对齐到cluster,这也是和bigalloc特性相关的逻辑。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
...
/*
* See if request is beyond maximum number of blocks we can have in
* a single extent. For an initialized extent this limit is
* EXT_INIT_MAX_LEN and for an unwritten extent this limit is
* EXT_UNWRITTEN_MAX_LEN.
*/
if (map->m_len > EXT_INIT_MAX_LEN &&
!(flags & EXT4_GET_BLOCKS_UNWRIT_EXT))
map->m_len = EXT_INIT_MAX_LEN;
else if (map->m_len > EXT_UNWRITTEN_MAX_LEN &&
(flags & EXT4_GET_BLOCKS_UNWRIT_EXT))
map->m_len = EXT_UNWRITTEN_MAX_LEN;

/* Check if we can really insert (m_lblk)::(m_lblk + m_len) extent */
newex.ee_len = cpu_to_le16(map->m_len);
err = ext4_ext_check_overlap(sbi, inode, &newex, path);
if (err)
allocated = ext4_ext_get_actual_len(&newex);
else
allocated = map->m_len;

/* allocate new block */
ar.inode = inode;
ar.goal = ext4_ext_find_goal(inode, path, map->m_lblk);
ar.logical = map->m_lblk;
/*
* We calculate the offset from the beginning of the cluster
* for the logical block number, since when we allocate a
* physical cluster, the physical block should start at the
* same offset from the beginning of the cluster. This is
* needed so that future calls to get_implied_cluster_alloc()
* work correctly.
*/
offset = EXT4_LBLK_COFF(sbi, map->m_lblk);
ar.len = EXT4_NUM_B2C(sbi, offset+allocated);
ar.goal -= offset;
ar.logical -= offset;
if (S_ISREG(inode->i_mode))
ar.flags = EXT4_MB_HINT_DATA;
else
/* disable in-core preallocation for non-regular files */
ar.flags = 0;
if (flags & EXT4_GET_BLOCKS_NO_NORMALIZE)
ar.flags |= EXT4_MB_HINT_NOPREALLOC;
if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
ar.flags |= EXT4_MB_DELALLOC_RESERVED;
if (flags & EXT4_GET_BLOCKS_METADATA_NOFAIL)
ar.flags |= EXT4_MB_USE_RESERVED;
newblock = ext4_mb_new_blocks(handle, &ar, &err);
if (!newblock)
goto out;
allocated_clusters = ar.len;
ar.len = EXT4_C2B(sbi, ar.len) - offset;
ext_debug(inode, "allocate new block: goal %llu, found %llu/%u, requested %u\n",
ar.goal, newblock, ar.len, allocated);
if (ar.len > allocated)
ar.len = allocated;

...

完成分配后,将新创建的extent插入extent tree并返回。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
...

got_allocated_blocks:
/* try to insert new extent into found leaf and return */
pblk = newblock + offset;
ext4_ext_store_pblock(&newex, pblk);
newex.ee_len = cpu_to_le16(ar.len);
/* Mark unwritten */
if (flags & EXT4_GET_BLOCKS_UNWRIT_EXT) {
ext4_ext_mark_unwritten(&newex);
map->m_flags |= EXT4_MAP_UNWRITTEN;
}

err = ext4_ext_insert_extent(handle, inode, &path, &newex, flags);
if (err) {
/* free data we just allocated */
...
}

/*
* Reduce the reserved cluster count to reflect successful deferred
* allocation of delayed allocated clusters or direct allocation of
* clusters discovered to be delayed allocated. Once allocated, a
* cluster is not included in the reserved count.
*/
if (test_opt(inode->i_sb, DELALLOC) && allocated_clusters) {
/* dealloc related logic */
...
}

/*
* Cache the extent and update transaction to commit on fdatasync only
* when it is _not_ an unwritten extent.
*/
if ((flags & EXT4_GET_BLOCKS_UNWRIT_EXT) == 0)
ext4_update_inode_fsync_trans(handle, inode, 1);
else
ext4_update_inode_fsync_trans(handle, inode, 0);

map->m_flags |= (EXT4_MAP_NEW | EXT4_MAP_MAPPED);
map->m_pblk = pblk;
map->m_len = ar.len;
allocated = map->m_len;
ext4_ext_show_leaf(inode, path);
out:
ext4_ext_drop_refs(path);
kfree(path);

trace_ext4_ext_map_blocks_exit(inode, flags, map,
err ? err : allocated);
return err ? err : allocated;
}

More

ext4_find_extent

该函数通过二分查找得到一个邻近LBN的extent,[2]中指出了几种可能的相互位置关系。

bigalloc

该特性允许分配器将多个block组成的cluster当作最小分配单元。


  1. Kernel Doc
  2. Ext4 - extent
  3. ext4块分配机制及源码分析