eulerfs-挂载,初始化和分配器

挂载和初始化

eufs_fill_super()

eufs_fill_super作为mount_bdev的参数在mount eulerfs时被调用。

调用关系（大致）：

eufs_fill_super()

　->eufs_get_block_info()

　　->dax_direct_access()

　->eufs_parse_options()

　->eufs_get_super()

　->eufs_init()

　->eufs_recognize_fs()

　　->eufs_check_super()

　->eufs_iget()

　　->eufs_read_pinode()

1	static int eufs_fill_super(struct super_block sb, void data, int silent);

eufs_fill_super过程涉及3个superblock结构：

struct super_block，vfs统一定义的sb结构，其s_fs_info域指向文件系统特定的sb。
struct eufs_sb_info，位于内存中的super-block data，fill_super过程中会为这个结构分配内存并初始化。
struct eufs_super_block，持久化存储的super-block data。位于pmem虚拟地址空间的首部（sbi->virt_addr）。

以下略过了一些eulerfs全局的初始化函数，如nv_init ，dep_init和wear_init。

eufs_get_block_info()

1	static int eufs_get_block_info(struct super_block sb, struct eufs_sb_info sbi);

eufs_get_block_info过程主要使用dax_direct_access将pmem块设备转为地址空间中，通过字节寻址进行访问。该函数将支持dax的设备dax_dev中第pgoff页开始的nr_pages个页翻译为内存地址进行访问，其虚拟地址存入kaddr，物理地址（frame number）存入pfn中，返回翻译的页总数。

这里i_size_read(sb->s_bdev->bd_inode)的值即为pmem设备的总大小。

long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn);

// file: super.c
size = dax_direct_access(
		dax_dev, 0, i_size_read(sb->s_bdev->bd_inode) >> PAGE_SHIFT,
		&virt_addr, &pfn);

virt_addr，pfn和size等信息会被存入sbi作为内存super block结构的一部分。

eufs_parse_options()

1	static int eufs_parse_options(char options, struct eufs_sb_info sbi, bool remount);

该函数处理mount过程中通过-o传入的参数，可选参数列表在变量tokens中定义，其实只有一个[init]。指定的参数被存放在sbi的s_mount_opt域中。

eufs_get_super()

1	static __always_inline struct eufs_super_block eufs_get_super(struct super_block sb);

该函数通过sb->s_fs_info拿到struct eufs_sb_info *sbi，再通过sbi->virt_addr拿到pmem首部的超级块。

eufs_init()

1	static struct eufs_inode eufs_init(struct super_block sb, unsigned long size);

如果指定了[init]选项，该函数在pmem上初始化文件系统，包括初始化struct super_block和初始化root directory。

eufs_recognize_fs()

1	static __always_inline int eufs_recognize_fs(struct super_block *sb);

该函数使用eufs_check_super，通过计算crc校验和的方式对超级块完整性进行校验。超级块有primary块和备份块secondary，如果primary块错误而secondary块正确，还要用secondary块同步primary块。

eufs_iget()

1	struct inode eufs_iget(struct super_block sb, struct eufs_inode *pi);

和super block类似，inode同样分为3类：

struct inode，vfs统一的inode结构。被struct eufs_inode_info所包含。
struct eufs_inode_info，位于内存中的inode data，struct inode是其一个部分。
struct eufs_inode，持久化存储的inode data。

eufs_iget的主要目的是从struct eufs_inode得到struct inode，参数pi是直接用super->s_root_pi从pmem上拿到的。首先使用vfs提供的接口iget_locked通过pi得到的inode number获取struct inode，如果获取到的不是new inode，说明该inode已被初始化，可以直接返回。否则还需要使用eufs_read_pinode根据pi的信息填充inode再返回。

eufs_iget()返回的struct inode*会被用来填充struct super_block的s_root域，所以这个函数是为了迎合vfs统一的接口。

分配器

数据结构

eulerfs在DRAM上的allocator涉及以下的数据结构：

struct ptr_list_node是分配的基本单位，对应pmem上的a page（4K）或a cache line（64B）。
sbi->cached_nodes：长度为npages的ptr_list_node数组，由page number寻址，每个元素对应pmem上的一页。如果某一页是空闲的，它对应的cached_nodes元素应该位于某个内存池中。
sbi->line_node_ptrs：长度为npages的ptr_list_node指针数组，如果某个pmem页被以cache line维度分配，它对应的line_node_ptrs元素会指向一个长度为(PAGE_SIZE/CACHELINE_SIZE)的ptr_list_node数组，该数组中每个元素对应page中的一条cache line。
sbi->line_indicators：长度为npages的uint8数组，如果某个pmem页被以cache line维度分配，它对应的line_indicators元素记录了该页里已分配的cache line数目。这个变量主要是为了在页内所有63条cache line都被释放时把它们合并回free page。
page_info_t枚举值，page_map的元素类型，记录了pmem页的种类。

分配单元

eulerfs中有4种大小的pmem分配单元，每次调用nvmalloc会返回其中的一种：

名称	描述	大小
large page	512张页拼成的大页	2MB
page	页	4KB
cache line 4	4条缓存行	256B
cache line	缓存行	64B

内存池

内存池struct mem_pool是DRAM中管理空闲分配单元的数据结构，每个内存池有4种分配大小的struct ptr_list_node链表和元素计数。有3种内存池：

名称	类型	描述
global pool	全局	全局内存池，有锁保护
local pool	CPU局部	每个CPU的局部内存池，不需要锁
rest pool	全局	被频繁使用的page被放入rest pool

fetch_count只对local pool有效，代表局部内存池每次扩增时从全局内存池取得的元素数目。

global pool需要预留NR_RESERVED_PAGES个page和large page。

struct mem_pool {
	struct list_head page_list; /* points to ptr_lists_node */
	struct list_head line_list; /* points to ptr_lists_node */
	struct list_head line4_list;
	struct list_head large_list;
	u64 npages;
	u64 nlines;
	u64 nline4s;
	u64 nlarges;
	int fetch_count;
};

nv_init()

1	void nv_init(struct super_block *sb, bool init);

pmem上的空间分配情况在挂载/初始化文件系统时会被记录到DRAM中以便于分配。nv_init完成了这一工作，它在eufs_fill_super中会被调用一次。

nv_init做一些mem_pool的全局初始化工作，之后调用partition准备好page_map，cached_nodes，line_node_ptrs等信息。

partition()

1	static void partition(struct super_block *sb, bool init);

partition初始化/读取pmem上的page_map信息，然后遍历所有页，根据其在page_map中记录的页状态做对应的初始化操作。

large page，该页是一个512*4K的大页，需要一并设置512个cache_nodes元素。
page已被使用，如用作super block，page map或存储文件数据等等。此时设置对应的cache_nodes元素标明页已被使用即可。
page空闲。此时需要把该空闲页放入mem_pool中以供分配；如果有很多连续的free page，还可以合成一个512*4K的大页一起放入mem_pool中。
page按照cache line进行分配。此时在设置cache_nodes元素的同时，还需要调用partition_page，以类似的过程对该页中的每一cache line做诸如设置line_node_ptrs，将空闲cache line放入mem_pool等初始化操作。

在填充mem_pool时，空闲的分配单元优先放入每个CPU的local pool，达到阈值EUFS_PRE_PAGES_PERCPU后放入全局pool中。

nvmalloc()

1	void nvmalloc(struct super_block sb, size_t size, u8 tag, bool nonblocking);

nvmalloc根据4种size找到对应的try_get_方法，首先进行try alloc。如果这次尝试失败了，nvmalloc会调用gather_pages尝试从其他CPU的local pool，甚至从rest pool获取free page（cache line可以从free page拆出来，见下），然后再次进行try alloc。

注意nvmalloc不会实际写pmem（即修改page_map），只是在DRAM里进行分配。

4种try_get方法

eufs_try_get_page()

1	static void eufs_try_get_page(struct eufs_sb_info sbi, struct mem_pool *ppool, u8 tag, bool use_reserved);

eufs_try_get_page是分配page时对应的try_get_方法。该方法“尝试性”地进行分配，即先从local pool的free list里取，如果local pool没有free page会调用reload_page_from_gpool从global pool里取free page补充到local pool。如果这也失败了，会尝试获取空闲的large page再分成小页。如果再次失败则返回空指针。eufs_try_get_page不会从其他CPU的local pool里抢free page，这个由nvmalloc在retry时完成。

try_get_large_page()

1	static void try_get_large_page(struct eufs_sb_info sbi, struct mem_pool *ppool, u8 tag, bool nonblocking);

try_get_large_page是分配large page时对应的try_get_方法。该方法与eufs_try_get_page类似，先尝试本地，再从全局pool获取。

try_get_line()

1	static void try_get_line(struct eufs_sb_info sbi, struct mem_pool *ppool, u8 tag, bool use_reserved)；

try_get_line是分配cache line时对应的try_get_方法。该方法和eufs_try_get_page类似，但是还会尝试用split_page_to_lines把free page拆成cache line，或者把空闲cache line 4拆成cache line使用。

try_get_line4()

1	static void try_get_line4(struct eufs_sb_info sbi, struct mem_pool *ppool, u8 tag, bool use_reserved);

try_get_line4是分配4条cache line时对应的try_get_方法。该方法和eufs_try_get_page类似，不同的是全局mem pool不会维护空闲的line4链表，所以当local pool为空时不会从全局mem pool取，而是取free page用split_page_to_lines拆分。

nvfree()

1	void nvfree(struct super_block sb, void ptr, bool rest);

nvfree释放由nvalloc分配的内存。包括在page_map中标记对象为free，在cached_nodes或line_nodes_ptrs中修改对应的struct ptr_list_node，以及调用return_{page,line4,cl}将释放掉的对象返回给内存池（对于return_cl，如果page中的全部63条cache line都被释放，会merge成一个free page）。

疑问：large page的释放？

_unset_bitmap()

1	static void _unset_bitmap(struct eufs_sb_info *sbi, u64 addr, bool flush);

_unset_bitmap是一个修改pmem上page_map的helper function，在pbatch.h中定义了一个相对应的_set_bitmap。

如果addr是一个页指针，该函数需要检查该页对应ptr_list_node的solid域，如果为false说明这是一个nvmalloc→ nvfree的执行流，对page_map的修改从来没有被flush到pmem上，此时直接返回即可；如果为true，说明nvmalloc之后已经把分配信息持久化到了page_map中，此时需要修改page_map将该页标记为free，然后flush刷写到pmem上。

如果addr是一个cache line指针，该函数先检查该cache line所在page的page_map项是否被设置为EUFS_PAGE_LINE_USED，如果不是（why?）需要现在设置。之后在line_map（即该page的第一个cache line）里将这一cache line标记为free。

eufs_alloc_persist()

1	static __always_inline void eufs_alloc_persist(struct super_block sb, void ptr, bool forced);

该函数将nvmalloc的分配结果持久化。首先使用_set_bitmap设置page_map，如果ptr是页指针就只刷写page_map；如果是cache line指针还要刷写它所在页的line_map。

_set_bitmap()

1	static __always_inline void _set_bitmap(struct eufs_sb_info *sbi, u64 addr, bool forced);

该函数与_unset_bitmap对应，修改pmem上的page_map。不同的是它没有任何flush操作，留给调用者完成。如果addr是一个页指针，则修改该页对应的page_map项；如果addr是一个cache line指针，除修改所在页的line_map项外，还要将该页的page_map项改为EUFS_PAGE_LINE_USED。

batch allocator

alloc_batch

使用struct alloc_batch相关的API可以将一批分配请求持久化到pmem上，这样做可以避免每分配一页就flush一次page_map的巨大开销。

注意：这些API仅仅修改分配请求对应的page_map和line_map项（即metadata），并不涉及对分配出去的pmem上的data持久化。

eufs_alloc_batch_* API usage:

struct alloc_batch {
	/* both in slots */
	long size;
	long n_used; //已经被add到batch中的元素数目
	void **batch;
	long n_pending; //只是预分配，还没有被add到batch中的元素数目
	struct list_head list;
} batch;
eufs_alloc_batch_init(&batch, estimated_size);
eufs_alloc_batch_hint(&batch, estimated_size);
eufs_alloc_batch_add(&batch, the_page_pointer);
eufs_alloc_batch_add(&batch, the_page_pointer);
...
eufs_alloc_batch_add(&batch, the_page_pointer);
eufs_alloc_batch_persist_reset(&batch);

eufs_alloc_batch_fini(&batch);

eufs_alloc_batch_init()

1	static __always_inline void eufs_alloc_batch_init(struct alloc_batch *pb, ssize_t size);

该函数初始化struct alloc_batch结构体，并调用一次eufs_alloc_batch_hint为pb->batch分配空间，确定pb->size。

eufs_alloc_batch_hint()

1	static __always_inline void eufs_alloc_batch_hint(struct alloc_batch *pb, ssize_t size)；

该函数对pb->batch调用了krealloc，为pb设置size。

eufs_alloc_batch_fini()

1	static __always_inline void eufs_alloc_batch_fini(struct alloc_batch *pb)；

该函数清理pb结构体。

eufs_alloc_batch_add()

1	static __always_inline void eufs_alloc_batch_add(struct super_block sb, struct alloc_batch pb, void *page);

该函数将一页加入到pb->batch中。如果pb已满，使用eufs_alloc_batch_hint对其进行扩容，如果扩不了就用eufs_alloc_batch_persist_reset将pb持久化。

eufs_alloc_batch_persist_reset()

1	static __always_inline void eufs_alloc_batch_persist_reset(struct super_block sb, struct alloc_batch pb)；

该函数将pb中记录的分配请求（pb->batch数组中的指针）持久化到pmem中（即写page_map）。

首先对pb->batch数组从小到大排序，逐元素调用_set_bitmap对其page_map/line_map进行修改，然后对pb->batch进行遍历，当前后两个元素所在的页对应的page_map项位于同一cache line（64字节）时，可以只flush一次cache line。通过这种方式，减少了分配连续的页时对page_map进行flush的次数。

preallocation

要使用struct alloc_batch作为pre allocator，还需要以下的api准备好预分配好的page，并将它们加入struct alloc_batch。这些api使得可以一次性分配若干页的对象，而后就可以向pre allocator请求分配而不用每次调用nvmalloc。注意这里的“预分配”不涉及pmem的持久化，预分配只操作了DRAM，持久化是在eufs_alloc_batch_persist_reset中完成的。

eufs_alloc_batch_pre_allocate_begin()

1
2
3

static __always_inline int
eufs_alloc_batch_pre_allocate_begin(struct super_block *sb,
				     struct alloc_batch *ab, size_t need_blocks);

该函数在eufs_alloc_batch_allocate之前调用，将ab->n_pending设置为need_blocks，并调用nvmalloc_pre预分配need_blocks页内存。

nvmalloc_pre()

1	int nvmalloc_pre(struct super_block sb, struct alloc_batch ab, size_t count, size_t size);

该函数在DRAM中一次性分配count块size大小（只支持PAGE_SIZE）的对象，并存入ab->list中。

如果local pool中有足够多的page（也可以把large page拆成512张page），该函数调用preallocate_page_from_pool从local pool中分配，否则调用preallocate_page_from_gpool从global pool中分配。这两个函数都会调用preallocate_pages_from_larges_and_pages，优先从内存池的page_list里取空闲页加入ab->list，如果不够再拆分large page成小page加入ab->list。

eufs_alloc_batch_allocate()

1	static __always_inline void eufs_alloc_batch_allocate(struct super_block sb, struct alloc_batch *ab, u8 tag);

该函数从ab->list里取一页并用eufs_alloc_batch_add加入ab->batch，令ab->n_pending--。

eufs_alloc_batch_pre_allocate_end()

1	static __always_inline void eufs_alloc_batch_pre_allocate_end(struct super_block sb, struct alloc_batch ab)；

该函数确保预分配结束时，所有预分配的页都真正被分配出去了，即ab->list中没有元素且ab->n_pending为0。