Memory安全和硬件Memory Tagging技术(4)

如前所述，支持heap tagging需要修改Linux kernel和C库里面的malloc相关实现。支持stack tagging需要使用一个编译选项重新编译代码。下面软件对怎么实现做一个讲解。

先看一个旧闻，Adopting the Arm Memory Tagging Extension in Android
https://security.googleblog.c...
https://threatpost.com/google...
Google和arm一起正在为Android开发支持MTE的LLVM编译器和Linux Kernel.

Android library allocator对MTE的支持
对malloc出来的memory的tag生成和管理的职责是user space的allocator，Linux kernel如上篇所述主要负责

根据应用的要求将一块内存设置成为Normal Tagged Memory,以便tags可以被存储，
在page swap/migration时保护和恢复tags
处理包含tag过的地址的系统调用

由此可知Linux kernel本不是tag的管理者。

在Android 11中默认的system allocator是Scudo. llvm-project/compiler-rt/lib/scudo/standalone at master · llvm/llvm-project · GitHub , 它替代了原来的jemalloc。

Scudo 是一个动态的用户模式内存分配器（也称为堆分配器），旨在抵御与堆相关的漏洞（如基于堆的缓冲区溢出、释放后再使用和双重释放），同时保持性能良好。它提供了标准 C 分配和取消分配基元（如 malloc 和 free），以及 C++ 基元（如 new 和 delete）。

与 AddressSanitizer (ASan) 等成熟的内存错误检测器相比，Scudo 更像是一个缓解工具。

它支持arm的MTE，在allocate内存时，

它会对齐到16 byte
Malloc用到mmap时，它会选择一个随机tag，应用到返回指针和存到内存中
在free的时候，它选择一个随机tag，存到要free内存中
在malloc reuse内存的时候，它从内存中读取tag，应用到指针中
特殊情况处理：施放到OS的内存丢失tag的情况，改变分配内存大小需要做内存tag的fixup

Scudo通过一个个Chunk来管理内存，Scudo在每个分配的memory block前加上8 byte的header。在分配tag的时候，保留0为chunk header的tag，在free内存时从不重用tag值。为相邻的chunk分配tag时，一个分配奇数的tag一个分配偶数的tag，尽大程度避免分到相同的tag：

可以100%检测到相邻chunk的overflow
87%检测到use-after-free (从15/16, 93%降到87%)

代码请参见https://github.com/llvm/llvm-...

inline void setRandomTag(void *Ptr, uptr Size, uptr ExcludeMask,
                         uptr *TaggedBegin, uptr *TaggedEnd) {
  void *End;
  __asm__ __volatile__(
      R"(
    .arch_extension mte
    // Set a random tag for Ptr in TaggedPtr. This needs to happen even if
    // Size = 0 so that TaggedPtr ends up pointing at a valid address.
    irg %[TaggedPtr], %[Ptr], %[ExcludeMask]
    mov %[Cur], %[TaggedPtr]
    // Skip the loop if Size = 0. We don't want to do any tagging in this case.
    cbz %[Size], 2f
    // Set the memory tag of the region
    // [TaggedPtr, TaggedPtr + roundUpTo(Size, 16))
    // to the pointer tag stored in TaggedPtr.
    add %[End], %[TaggedPtr], %[Size]
  1:
    stzg %[Cur], [%[Cur]], #16
    cmp %[Cur], %[End]
    b.lt 1b
  2:
  )"
      :
      [TaggedPtr] "=&r"(*TaggedBegin), [Cur] "=&r"(*TaggedEnd), [End] "=&r"(End)
      : [Ptr] "r"(Ptr), [Size] "r"(Size), [ExcludeMask] "r"(ExcludeMask)
      : "memory");
}

inline void *prepareTaggedChunk(void *Ptr, uptr Size, uptr ExcludeMask,
                                uptr BlockEnd) {
  // Prepare the granule before the chunk to store the chunk header by setting
  // its tag to 0. Normally its tag will already be 0, but in the case where a
  // chunk holding a low alignment allocation is reused for a higher alignment
  // allocation, the chunk may already have a non-zero tag from the previous
  // allocation.
  __asm__ __volatile__(".arch_extension mte; stg %0, [%0, #-16]"
                       :
                       : "r"(Ptr)
                       : "memory");

  uptr TaggedBegin, TaggedEnd;
  setRandomTag(Ptr, Size, ExcludeMask, &TaggedBegin, &TaggedEnd);

  // Finally, set the tag of the granule past the end of the allocation to 0,
  // to catch linear overflows even if a previous larger allocation used the
  // same block and tag. Only do this if the granule past the end is in our
  // block, because this would otherwise lead to a SEGV if the allocation
  // covers the entire block and our block is at the end of a mapping. The tag
  // of the next block's header granule will be set to 0, so it will serve the
  // purpose of catching linear overflows in this case.
  uptr UntaggedEnd = untagPointer(TaggedEnd);
  if (UntaggedEnd != BlockEnd)
    __asm__ __volatile__(".arch_extension mte; stg %0, [%0]"
                         :
                         : "r"(UntaggedEnd)
                         : "memory");
  return reinterpret_cast<void *>(TaggedBegin);
}

大块heap的分配
为不马上使用的，或是部分使用的大块heap设置tag代价很高，可以用两种处理：

不设tag。在分配heap周围使用gurad page，并不重用虚拟地址
使用一个非零的tag的page作为copy-on-write的参考page https://lwn.net/Articles/828828

Stack tagging
对与stack来说，对运行时态栈进行tag，需要编译器支持和内核支持。如下篇所说，compiler选择使用IRG指令为函数进入时分配的新栈帧生成随机tag的策略。编译器然后使用ADDG和SUBG指令为函数内的每个栈片创建tag过的地址，这个tag是初始随机tag的offset。

以下代码例子来演示，

extern int bar(int * p);

void func() {
int a = 1, b=2;
bar(&a);
bar(&b);
}

利用arm clang 11编译器，在不使用memory tagging编译选项的情况下，编译出的指令为，

在使用-fsanitize=memtag -march=armv8+memtag 编译选项的情况下，编译出的指令为，

func():                               // @func()
        sub     sp, sp, #64                     // =64
        stp     x29, x30, [sp, #32]             // 16-byte Folded Spill
        str     x19, [sp, #48]                  // 8-byte Folded Spill
        add     x29, sp, #32                    // =32
        irg     x8, sp
        mov     w9, #1
        mov     w10, #2
        addg    x0, x8, #16, #0
        addg    x19, x8, #0, #1
        stgp    x9, xzr, [x0]
        stgp    x10, xzr, [x19]
        bl      bar(int*)
        mov     x0, x19
        bl      bar(int*)
        st2g    sp, [sp], #32
        ldr     x19, [sp, #16]                  // 8-byte Folded Reload
        ldp     x29, x30, [sp], #32             // 16-byte Folded Reload
        ret

稍微解释一下这些指令：
IRG Xd, Xn
Copy Xn到Xd，并给Xd插入4bit tag

ADDG Xd, Xn, #<immA>, #<immB>

将Xn+#immA，并插入tag=#immB

STG  [Xn],  #<imm>

将Xn的tag存到Xn地址对应的内存tag

STGP Xa, Xb, [Xn],  #<imm>

将Xa, Xb的16byte值存到内存地址Xn对应的内存，并将Xn里的tag存到Xn地址对应的内存tag

ST2G Xa, [Xn], #imm

将Xa的tag存到Xn地址内存对应的2个memory颗粒（2 x 16 byte）的内存tag

在以上反汇编代码中的tag相关指令的作用是，在进入函数时对局部变量a和b设置不同的tag，设置的方式是，

先通过IRG产生一个随机tag
多个局部变量的tag产生的方式是在上面随机产生tag的基础上加上offset （0，1，2…）

在函数退出时，通过ST2G对设置的tag重设为0.

由上面stack的layout可以发现，为了满足tag granule（16 byte）的要求，a, b在stack中要各占16 byte，会造成一定的内存overhead.

https://llvm.org/devmtg/2018-...

Linux kernel对MTE的支持
Arm64 Linux 对MTE的开发已经经过了几个版本，现在到v5，计划在v5.9进行merge
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux devel/mte-v5
现在已经在linux-next分支中 https://kernel.googlesource.c...

https://github.com/sudipm-muk...

在上面代码里包含了主要的MTE支持代码。
现在的支持主要user space对应用的tag，这需要对kernel进行更新。如果要enable MTE比较简单，只需要kernel configure加上CONFIG_ARM64_MTE=y即可。Kernel加入的主要功能是：

支持检查CPU是否支持MTE功能。

#ifdef CONFIG_ARM64_MTE
    {
  .desc = "Memory Tagging Extension",
  .capability = ARM64_MTE,
  .type = ARM64_CPUCAP_STRICT_BOOT_CPU_FEATURE,
  .matches = has_cpuid_feature,
  .sys_reg = SYS_ID_AA64PFR1_EL1,
  .field_pos = ID_AA64PFR1_MTE_SHIFT,
  .min_field_value = ID_AA64PFR1_MTE,
  .sign = FTR_UNSIGNED,
  .cpu_enable = cpu_enable_mte,
    },
#endif /* CONFIG_ARM64_MTE */

通过检查SYS_ID_AA64PFR1_EL1寄存器，确定CPU是否支持MTE

在__cpu_setup中，初始化GCR_EL1，SYS_TFSR_EL1，SYS_TFSRE0_EL1，TCR_EL1，mair_el1 MTE相关配置。

    /* Normal Tagged memory type at the corresponding MAIR index */
    mov    x10, #MAIR_ATTR_NORMAL_TAGGED
    bfi    x5, x10, #(8 *  MT_NORMAL_TAGGED), #8

    /* initialize GCR_EL1: all non-zero tags excluded by default */
    mov    x10, #(SYS_GCR_EL1_RRND | SYS_GCR_EL1_EXCL_MASK)
    msr_s    SYS_GCR_EL1, x10

    /* clear any pending tag check faults in TFSR*_EL1 */
    msr_s    SYS_TFSR_EL1, xzr
    msr_s    SYS_TFSRE0_EL1, xzr

    /* set the TCR_EL1 bits */
    mov_q    mte_tcr, TCR_KASAN_HW_FLAGS
1:
#endif
    msr    mair_el1, x5
    /*
     * Set/prepare TCR and TTBR. We use 512GB (39-bit) address range for
     * both user and kernel.
     */
    mov_q    x10, TCR_TxSZ(VA_BITS) | TCR_CACHE_FLAGS | TCR_SMP_FLAGS | \
            TCR_TG_FLAGS | TCR_KASLR_FLAGS | TCR_ASID16 | \
            TCR_TBI0 | TCR_A1 | TCR_KASAN_SW_FLAGS
#ifdef CONFIG_ARM64_MTE
    orr    x10, x10, mte_tcr
    .unreq    mte_tcr
#endif
    tcr_clear_errata_bits x10, x9, x5

把线性map区域设置成是Normal Tagged Memory 以便kernel可以读写tag
新增MT_NORMAL_TAGGED normal type

#define MAIR_EL1_SET                            \
    (MAIR_ATTRIDX(MAIR_ATTR_DEVICE_nGnRnE, MT_DEVICE_nGnRnE) |    \
     MAIR_ATTRIDX(MAIR_ATTR_DEVICE_nGnRE, MT_DEVICE_nGnRE) |    \
     MAIR_ATTRIDX(MAIR_ATTR_DEVICE_GRE, MT_DEVICE_GRE) |        \
     MAIR_ATTRIDX(MAIR_ATTR_NORMAL_NC, MT_NORMAL_NC) |        \
     MAIR_ATTRIDX(MAIR_ATTR_NORMAL, MT_NORMAL) |            \
     MAIR_ATTRIDX(MAIR_ATTR_NORMAL_WT, MT_NORMAL_WT) |        \
     MAIR_ATTRIDX(MAIR_ATTR_NORMAL, MT_NORMAL_TAGGED))

#define PROT_NORMAL_TAGGED    (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL_TAGGED)) 
#define PAGE_KERNEL_TAGGED    __pgprot(PROT_NORMAL_TAGGED)

static void __init map_mem(pgd_t *pgdp)
{
..
for_each_mem_range(i, &start, &end) {
        if (start >= end)
            break;
        /*
         * The linear map must allow allocation tags reading/writing
         * if MTE is present. Otherwise, it has the same attributes as
         * PAGE_KERNEL.
         */
        __map_memblock(pgdp, start, end, PAGE_KERNEL_TAGGED, flags);
    }
..
}

处理clear_page, copy_page的tag，和tag过的page的page swap.
代码在mte.S 里，

/*
 * Clear the tags in a page
 *   x0 - address of the page to be cleared
 */
SYM_FUNC_START(mte_clear_page_tags)
    multitag_transfer_size x1, x2
1:    stgm    xzr, [x0]
    add    x0, x0, x1
    tst    x0, #(PAGE_SIZE - 1)
    b.ne    1b
    ret
SYM_FUNC_END(mte_clear_page_tags)

/*
 * Copy the tags from the source page to the destination one
 *   x0 - address of the destination page
 *   x1 - address of the source page
 */
SYM_FUNC_START(mte_copy_page_tags)
    mov    x2, x0
    mov    x3, x1
    multitag_transfer_size x5, x6
1:    ldgm    x4, [x3]
    stgm    x4, [x2]
    add    x2, x2, x5
    add    x3, x3, x5
    tst    x2, #(PAGE_SIZE - 1)
    b.ne    1b
    ret
SYM_FUNC_END(mte_copy_page_tags)

由代码可知，主要使用ldgm, stgm实现。

在mteswap.c里处理tag过的page的swap。

int mte_save_tags(struct page *page)
bool mte_restore_tags(swp_entry_t entry, struct page *page)

应用tag fault异常处理和SIGSEGV注入
在fault类型里面加入

static const struct fault_info fault_info[] = {

{ do_tag_check_fault,    SIGSEGV, SEGV_MTESERR,    "synchronous tag check fault"    }

在arch/arm64/mm/fault.c里，

static int do_tag_check_fault(unsigned long far, unsigned int esr,
                  struct pt_regs *regs)
{
    /*
     * The architecture specifies that bits 63:60 of FAR_EL1 are UNKNOWN for tag
     * check faults. Mask them out now so that userspace doesn't see them.
     */
    far &= (1UL << 60) - 1;
    do_bad_area(far, esr, regs);
    return 0;
}

对应user space的tag check错误，在do_bad_area-> arm64_force_sig_fault中signal SIGSEGV, 原因设置为SEGV_MTESERR。

增加PROT_MTE属性，给user space使能tag的check，结合arch_calc_vm_flag_bits() 和arch_validate_flags()，user space可以通过mmap,或mprotect系统调用要求kernel将对应的user space memory设置成Normal Tagged Memory type.

static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
    unsigned long pkey __always_unused)
{
    unsigned long ret = 0;

    if (system_supports_bti() && (prot & PROT_BTI))
  ret |= VM_ARM64_BTI;

    if (system_supports_mte() && (prot & PROT_MTE))
  ret |= VM_MTE;

    return ret;
}

static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
{
    pteval_t prot = 0;

    if (vm_flags & VM_ARM64_BTI)
  prot |= PTE_GP;

    /*
     * There are two conditions required for returning a Normal Tagged
     * memory type: (1) the user requested it via PROT_MTE passed to
     * mmap() or mprotect() and (2) the corresponding vma supports MTE. We
     * register (1) as VM_MTE in the vma->vm_flags and (2) as
     * VM_MTE_ALLOWED. Note that the latter can only be set during the
     * mmap() call since mprotect() does not accept MAP_* flags.
     * Checking for VM_MTE only is sufficient since arch_validate_flags()
     * does not permit (VM_MTE & !VM_MTE_ALLOWED).
     */
    if (vm_flags & VM_MTE)
  prot |= PTE_ATTRINDX(MT_NORMAL_TAGGED);

    return __pgprot(prot);
}

由上面代码可以看出mmap, mprotect由arch_calc_vm_prot_bits转换成VM_MTE属性，VM_MTE在arch_vm_get_page_prot中转成MT_NORMAL_TAGGED（MAIR中的Normal Tagged Memory type）.

提供一个由prctl()支持的应用可以控制的设置，包含tag检查的fault 模式和tag排他值设置，其建立在PR_{SET, GET}_TAGGED_ADDR_CTRL之上。
在include/uapi/linux/prctl.h中，

/* Tagged user address controls for arm64 */
#define PR_SET_TAGGED_ADDR_CTRL        55
#define PR_GET_TAGGED_ADDR_CTRL        56
# define PR_TAGGED_ADDR_ENABLE        (1UL << 0)
/* MTE tag check fault modes */
# define PR_MTE_TCF_SHIFT        1
# define PR_MTE_TCF_NONE        (0UL << PR_MTE_TCF_SHIFT)
# define PR_MTE_TCF_SYNC        (1UL << PR_MTE_TCF_SHIFT)
# define PR_MTE_TCF_ASYNC        (2UL << PR_MTE_TCF_SHIFT)
# define PR_MTE_TCF_MASK        (3UL << PR_MTE_TCF_SHIFT)
/* MTE tag inclusion mask */
# define PR_MTE_TAG_SHIFT        3
# define PR_MTE_TAG_MASK        (0xffffUL << PR_MTE_TAG_SHIFT)

在kernel/sys.中

    case PR_SET_TAGGED_ADDR_CTRL:
        if (arg3 || arg4 || arg5)
            return -EINVAL;
        error = SET_TAGGED_ADDR_CTRL(arg2);
        break;
    case PR_GET_TAGGED_ADDR_CTRL:
        if (arg2 || arg3 || arg4 || arg5)
            return -EINVAL;
        error = GET_TAGGED_ADDR_CTRL();
        break;

在arch/arm64/kernel/process.c中

long set_tagged_addr_ctrl(struct task_struct *task, unsigned long arg)
{
    unsigned long valid_mask = PR_TAGGED_ADDR_ENABLE;
    struct thread_info *ti = task_thread_info(task);

    if (is_compat_thread(ti))
        return -EINVAL;

    if (system_supports_mte())
        valid_mask |= PR_MTE_TCF_MASK | PR_MTE_TAG_MASK;

    if (arg & ~valid_mask)
        return -EINVAL;

    /*
     * Do not allow the enabling of the tagged address ABI if globally
     * disabled via sysctl abi.tagged_addr_disabled.
     */
    if (arg & PR_TAGGED_ADDR_ENABLE && tagged_addr_disabled)
        return -EINVAL;

    if (set_mte_ctrl(task, arg) != 0)
        return -EINVAL;

    update_ti_thread_flag(ti, TIF_TAGGED_ADDR, arg & PR_TAGGED_ADDR_ENABLE);

    return 0;
}

在arch/arm64/kernel/mte.c

long set_mte_ctrl(struct task_struct *task, unsigned long arg)
{
    u64 tcf0;
    u64 gcr_excl = ~((arg & PR_MTE_TAG_MASK) >> PR_MTE_TAG_SHIFT) &
               SYS_GCR_EL1_EXCL_MASK;

    if (!system_supports_mte())
        return 0;

    switch (arg & PR_MTE_TCF_MASK) {
    case PR_MTE_TCF_NONE:
        tcf0 = SCTLR_EL1_TCF0_NONE;
        break;
    case PR_MTE_TCF_SYNC:
        tcf0 = SCTLR_EL1_TCF0_SYNC;
        break;
    case PR_MTE_TCF_ASYNC:
        tcf0 = SCTLR_EL1_TCF0_ASYNC;
        break;
    default:
        return -EINVAL;
    }

    if (task != current) {
        task->thread.sctlr_tcf0 = tcf0;
        task->thread.gcr_user_excl = gcr_excl;
    } else {
        set_sctlr_el1_tcf0(tcf0);
        set_gcr_el1_excl(gcr_excl);
    }

    return 0;
}

由上面代码可知，如果调用是当前task调用的set_mte_ctrl，将tag sync/async mode或要exclude的tag值的设置直接写到SCTLR_EL1.TCF0 或是GCR_EL1.EXCL里面，如果不是暂时写到task结构体的thread->sctlr_tcf0和thread-> gcr_user_excl里。
那这个结构体里的值什么时候会设置到寄存器呢？答案是进程切换的时候。

arch/arm64/kernel/process.c
在__switch_to函数中，调用mte_thread_switch(next)，mte_thread_switch的实现在

arch/arm64/kernel/mte.c

void mte_thread_switch(struct task_struct *next)
{
    if (!system_supports_mte())
        return;

    /* avoid expensive SCTLR_EL1 accesses if no change */
    if (current->thread.sctlr_tcf0 != next->thread.sctlr_tcf0)
        update_sctlr_el1_tcf0(next->thread.sctlr_tcf0);
}

Memory安全和硬件Memory Tagging技术(1)
Memory安全和硬件Memory Tagging技术(2)
Memory安全和硬件Memory Tagging技术(3)
Memory安全和硬件Memory Tagging技术(4)
Linux Kernel MTE相关文档

推荐阅读

目录