

## **Cloud Operating Systems**

**Daniel Gruss** 

2024-03-04

## Moving to the cloud can save up to 87% of IT energy



## **Cloud means Efficiency**

 $\bullet$  Processes used to have access to all physical memory  $\rightarrow$  that's not efficient!

- $\bullet$  Processes used to have access to all physical memory  $\rightarrow$  that's not efficient!
- $\rightarrow\,$  Virtualize memory  $\rightarrow$  processes can share resources of one machine and utilize it better

- $\bullet$  Processes used to have access to all physical memory  $\rightarrow$  that's not efficient!
- $\rightarrow\,$  Virtualize memory  $\rightarrow$  processes can share resources of one machine and utilize it better
- $\bullet$  Processes need all the same pages  $\rightarrow$  that's not efficient!

- $\bullet$  Processes used to have access to all physical memory  $\rightarrow$  that's not efficient!
- $\rightarrow\,$  Virtualize memory  $\rightarrow$  processes can share resources of one machine and utilize it better
- $\bullet$  Processes need all the same pages  $\rightarrow$  that's not efficient!
- $\rightarrow\,$  Let them share memory, using COW, page deduplication, etc.

- $\bullet$  Processes used to have access to all physical memory  $\rightarrow$  that's not efficient!
- $\rightarrow\,$  Virtualize memory  $\rightarrow$  processes can share resources of one machine and utilize it better
- $\bullet$  Processes need all the same pages  $\rightarrow$  that's not efficient!
- $\rightarrow\,$  Let them share memory, using COW, page deduplication, etc.
- Processes often cannot do anything but wait  $\rightarrow$  that's not efficient!

- $\bullet$  Processes used to have access to all physical memory  $\rightarrow$  that's not efficient!
- $\rightarrow\,$  Virtualize memory  $\rightarrow$  processes can share resources of one machine and utilize it better
- $\bullet$  Processes need all the same pages  $\rightarrow$  that's not efficient!
- $\rightarrow\,$  Let them share memory, using COW, page deduplication, etc.
- $\bullet$  Processes often cannot do anything but wait  $\rightarrow$  that's not efficient!
- $\rightarrow\,$  Let other processes run in between





• Efficiency

- Efficiency
- Isolation of tenants (security, reliability, availability)

- Efficiency
- Isolation of tenants (security, reliability, availability)
- Abstraction of hardware

**Virtualization** allows to represent resources in a computer in a way they can be used easily and without the need to know details of their properties

• Decouple operating system from hardware

- Decouple operating system from hardware
  - "computer in computer" implemented in software

- Decouple operating system from hardware
  - "computer in computer" implemented in software
  - includes devices (network, keyboard, sound...)

- Decouple operating system from hardware
  - "computer in computer" implemented in software
  - includes devices (network, keyboard, sound...)
- OS in VM "sees" its hardware, irrespective from the actual hardware in use

- Decouple operating system from hardware
  - "computer in computer" implemented in software
  - includes devices (network, keyboard, sound...)
- OS in VM "sees" its hardware, irrespective from the actual hardware in use
- OS does not know if HW is concurrently used by other VMS







• Cheaper hardware: one server for one task was common



- Cheaper hardware: one server for one task was common
- most of these servers: idle time 90%

5



- Cheaper hardware: one server for one task was common
- most of these servers: idle time 90%
- cost issue:



- Cheaper hardware: one server for one task was common
- most of these servers: idle time 90%
- cost issue:
  - support, maintenance



- Cheaper hardware: one server for one task was common
- most of these servers: idle time 90%
- cost issue:
  - support, maintenance
  - power consumption (operation, cooling)



- Cheaper hardware: one server for one task was common
- most of these servers: idle time 90%
- cost issue:
  - support, maintenance
  - power consumption (operation, cooling)
  - space



- Cheaper hardware: one server for one task was common
- most of these servers: idle time 90%
- cost issue:
  - support, maintenance
  - power consumption (operation, cooling)
  - space
- Virtualization allows consolidation



- Cheaper hardware: one server for one task was common
- most of these servers: idle time 90%
- cost issue:
  - support, maintenance
  - power consumption (operation, cooling)
  - space
- Virtualization allows consolidation
  - multiple servers on one box

## Advantages







• Better hardware utilization



- Better hardware utilization
- Lower administration cost



- Better hardware utilization
- Lower administration cost
- long-term operations of older applications



- Better hardware utilization
- Lower administration cost
- long-term operations of older applications
- lower down-times



- Better hardware utilization
- Lower administration cost
- long-term operations of older applications
- lower down-times
- simple migration to more powerful hardware







• Performance cost: slower I/O operation



- Performance cost: slower I/O operation
- single point of failure: requires better hardware reliability



- Performance cost: slower I/O operation
- single point of failure: requires better hardware reliability
- security gets more complex

• Virtualization no significant role in internet hosting

- Virtualization no significant role in internet hosting
- often PaaS

- Virtualization no significant role in internet hosting
- often PaaS
- Web hosts (FTP access, HTTP website)

- Virtualization no significant role in internet hosting
- often PaaS
- Web hosts (FTP access, HTTP website)
- Isolation on the OS level (tenants as users)

- Virtualization no significant role in internet hosting
- often PaaS
- Web hosts (FTP access, HTTP website)
- Isolation on the OS level (tenants as users)
- no hardware support  $\rightarrow$  expensive + many problems

• OS-level Virtualization

- OS-level Virtualization
- Para-Virtualization

- OS-level Virtualization
- Para-Virtualization
- Full Virtualization

- OS-level Virtualization
- Para-Virtualization
- Full Virtualization
- Hardware-Assisted Virtualization

• integrated into kernel

- integrated into kernel
- all application software intended to run in a virtual environment get strictly separated virtual runtime environments (container, jail)

- integrated into kernel
- all application software intended to run in a virtual environment get strictly separated virtual runtime environments (container, jail)
- no separate kernels only process level virtualization

- integrated into kernel
- all application software intended to run in a virtual environment get strictly separated virtual runtime environments (container, jail)
- no separate kernels only process level virtualization
- can't run other OSes only for applications

- integrated into kernel
- all application software intended to run in a virtual environment get strictly separated virtual runtime environments (container, jail)
- no separate kernels only process level virtualization
- can't run other OSes only for applications
- examples: OpenVZ, Docker, (s)chroot



• Cooperation with OS: OS is aware of virtualization



- Cooperation with OS: OS is aware of virtualization
- needs to modify guest



- Cooperation with OS: OS is aware of virtualization
- needs to modify guest
- not usable for closed source OSes

• OS not aware of being virtualized

- OS not aware of being virtualized
- no need to adapt guest

- OS not aware of being virtualized
- no need to adapt guest
- reduced performance

- OS not aware of being virtualized
- no need to adapt guest
- reduced performance
  - up to 25%

- OS not aware of being virtualized
- no need to adapt guest
- reduced performance
  - up to 25%
- full virtualization of HW required (e.g., emulation via qemu)

- OS not aware of being virtualized
- no need to adapt guest
- reduced performance
  - up to 25%
- full virtualization of HW required (e.g., emulation via qemu)
  - virtual machines not allowed to access physical components

- OS not aware of being virtualized
- no need to adapt guest
- reduced performance
  - up to 25%
- full virtualization of HW required (e.g., emulation via qemu)
  - virtual machines not allowed to access physical components
  - every physical component has to be virtualized and requires drivers in OS

• Guest no longer runs in kernel mode (Ring 0)

- Guest no longer runs in kernel mode (Ring 0)
  - parts that require kernel privileges won't run

- Guest no longer runs in kernel mode (Ring 0)
  - parts that require kernel privileges won't run
- hypervisor (VMM) changes binaries of guest-OS on the fly

- Guest no longer runs in kernel mode (Ring 0)
  - parts that require kernel privileges won't run
- hypervisor (VMM) changes binaries of guest-OS on the fly
- allows supporting any OS

- Guest no longer runs in kernel mode (Ring 0)
  - parts that require kernel privileges won't run
- hypervisor (VMM) changes binaries of guest-OS on the fly
- allows supporting any OS
  - no need to change source

- Guest no longer runs in kernel mode (Ring 0)
  - parts that require kernel privileges won't run
- hypervisor (VMM) changes binaries of guest-OS on the fly
- allows supporting any OS
  - no need to change source
- high performance penalty

• First full x86 virtualization

- First full x86 virtualization
- hypervisor continuously reads program code before it is executed (pre-scan)

**.** .

- First full x86 virtualization
- hypervisor continuously reads program code before it is executed (pre-scan)
- looking for relevant commands

- First full x86 virtualization
- hypervisor continuously reads program code before it is executed (pre-scan)
- looking for relevant commands
  - change of system state

- First full x86 virtualization
- hypervisor continuously reads program code before it is executed (pre-scan)
- looking for relevant commands
  - change of system state
  - commands depending on CPU state

- First full x86 virtualization
- hypervisor continuously reads program code before it is executed (pre-scan)
- looking for relevant commands
  - change of system state
  - commands depending on CPU state
- sets breakpoint and lets OS run







• Diverse problems were to be solved when virtualizing on IA-32:



- Diverse problems were to be solved when virtualizing on IA-32:
  - Ring Problems



- Diverse problems were to be solved when virtualizing on IA-32:
  - Ring Problems
  - Address Space Compression



- Diverse problems were to be solved when virtualizing on IA-32:
  - Ring Problems
  - Address Space Compression
  - Non-Faulting Access to Priv. State



- Diverse problems were to be solved when virtualizing on IA-32:
  - Ring Problems
  - Address Space Compression
  - Non-Faulting Access to Priv. State
  - SYSENTER / SYSEXIT



- Diverse problems were to be solved when virtualizing on IA-32:
  - Ring Problems
  - Address Space Compression
  - Non-Faulting Access to Priv. State
  - SYSENTER / SYSEXIT
  - Interrupt Virtualization



- Diverse problems were to be solved when virtualizing on IA-32:
  - Ring Problems
  - Address Space Compression
  - Non-Faulting Access to Priv. State
  - SYSENTER / SYSEXIT
  - Interrupt Virtualization
  - Hidden States

• usually: application run in ring 3, kernel in ring 0

- usually: application run in ring 3, kernel in ring 0
- guest may not run in ring 0

- usually: application run in ring 3, kernel in ring 0
- guest may not run in ring 0
- $\bullet\,$  ring de-privileging needed: guest must run in ring >0

- usually: application run in ring 3, kernel in ring 0
- guest may not run in ring 0
- $\bullet\,$  ring de-privileging needed: guest must run in ring >0
  - most often 1 or 3

• guest has to run in a ring it has not been developed for

- guest has to run in a ring it has not been developed for
- certain instructions contain privilege level in result (e.g. PUSH CS)

- guest has to run in a ring it has not been developed for
- certain instructions contain privilege level in result (e.g. PUSH CS)
- guest OS can find out ring it is running in

- guest has to run in a ring it has not been developed for
- certain instructions contain privilege level in result (e.g. PUSH CS)
- guest OS can find out ring it is running in
- may result in diverse problems

• Guest expects to have full address space available

- Guest expects to have full address space available
- hypervisor requires part of address space

- Guest expects to have full address space available
- hypervisor requires part of address space
  - control structures for switching between guest and hypervisor

- Guest expects to have full address space available
- hypervisor requires part of address space
  - · control structures for switching between guest and hypervisor
- Access to these areas not allowed for guest. Invokes switch to hypervisor who has to emulate these accesses

• unprivileged software may not access certain elements of the CPU state

- unprivileged software may not access certain elements of the CPU state
- access by guest results in fault: hypervisor can emulate instructions

- unprivileged software may not access certain elements of the CPU state
- access by guest results in fault: hypervisor can emulate instructions
- IA-32 possesses instructions that do not induce a fault:

- unprivileged software may not access certain elements of the CPU state
- access by guest results in fault: hypervisor can emulate instructions
- IA-32 possesses instructions that do not induce a fault:
  - Registers GDTR, IDTR, LDTR and TR are only modifiable in ring 0

- unprivileged software may not access certain elements of the CPU state
- access by guest results in fault: hypervisor can emulate instructions
- IA-32 possesses instructions that do not induce a fault:
  - Registers GDTR, IDTR, LDTR and TR are only modifiable in ring 0
  - can be executed in any ring without fault (without function)

• special commands for fast syscalls

- special commands for fast syscalls
- SYSENTER always switches to ring 0

- special commands for fast syscalls
- SYSENTER always switches to ring 0
- SYSEXIT can only be executed in ring 0

- special commands for fast syscalls
- SYSENTER always switches to ring 0
- SYSEXIT can only be executed in ring 0
- ring 1 thus is problematic

- special commands for fast syscalls
- SYSENTER always switches to ring 0
- SYSEXIT can only be executed in ring 0
- ring 1 thus is problematic
  - SYSENTER switches to hypervisor  $\rightarrow$  has to emulate

- special commands for fast syscalls
- SYSENTER always switches to ring 0
- SYSEXIT can only be executed in ring 0
- ring 1 thus is problematic
  - SYSENTER switches to hypervisor  $\rightarrow$  has to emulate
  - SYSEXIT switches to hypervisor  $\rightarrow$  has to emulate

• interrupts can be masked (so they do not occur if not welcome)

- interrupts can be masked (so they do not occur if not welcome)
- controlled by IF-flag in EFLAGS-Register

- interrupts can be masked (so they do not occur if not welcome)
- controlled by IF-flag in EFLAGS-Register
- Interrupts managed by VM though

- interrupts can be masked (so they do not occur if not welcome)
- controlled by IF-flag in EFLAGS-Register
- Interrupts managed by VM though
- change of  $\mathsf{IF} \to \mathsf{fault}$  to hypervisor

- interrupts can be masked (so they do not occur if not welcome)
- controlled by IF-flag in EFLAGS-Register
- Interrupts managed by VM though
- change of  $\mathsf{IF}\to\mathsf{fault}$  to hypervisor
- OS do this quite often  $\rightarrow$  performance problem

- interrupts can be masked (so they do not occur if not welcome)
- controlled by IF-flag in EFLAGS-Register
- Interrupts managed by VM though
- change of  $\mathsf{IF}\to\mathsf{fault}$  to hypervisor
- OS do this quite often  $\rightarrow$  performance problem
- forwarding of virtual interrupts must consider IF







• Not all state-information accessible via registers



- Not all state-information accessible via registers
- cannot be saved and restored when switching between VMs

• Two new operating modes:

- Two new operating modes:
  - VMX root operation

- Two new operating modes:
  - VMX root operation
    - for hypervisor

- Two new operating modes:
  - VMX root operation
    - for hypervisor
  - VMX non-root operation

- Two new operating modes:
  - VMX root operation
    - for hypervisor
  - VMX non-root operation
    - controlled by hypervisor

- Two new operating modes:
  - VMX root operation
    - for hypervisor
  - VMX non-root operation
    - controlled by hypervisor
    - supports VMs

- Two new operating modes:
  - VMX root operation
    - for hypervisor
  - VMX non-root operation
    - controlled by hypervisor
    - supports VMs
- Both modes have ring 0-3

- Two new operating modes:
  - VMX root operation
    - for hypervisor
  - VMX non-root operation
    - controlled by hypervisor
    - supports VMs
- Both modes have ring 0-3
- guest can run in ring 0

- Two new operating modes:
  - VMX root operation
    - for hypervisor
  - VMX non-root operation
    - controlled by hypervisor
    - supports VMs
- Both modes have ring 0-3
- guest can run in ring 0
- hypervisor said to be running in "ring -1"







## VMM Transitions



• VM entry: root operation  $\rightarrow$  non-root operation

- VM entry: root operation  $\rightarrow$  non-root operation
- $\bullet~\text{VM}$  exit: non-root operation  $\rightarrow$  root operation

- VM entry: root operation  $\rightarrow$  non-root operation
- $\bullet~\text{VM}$  exit: non-root operation  $\rightarrow$  root operation
- VMCS: Virtual Machine Control Structure

- VM entry: root operation  $\rightarrow$  non-root operation
- $\bullet~\text{VM}$  exit: non-root operation  $\rightarrow$  root operation
- VMCS: Virtual Machine Control Structure
  - Guest-state-area

- VM entry: root operation  $\rightarrow$  non-root operation
- $\bullet~\text{VM}$  exit: non-root operation  $\rightarrow$  root operation
- VMCS: Virtual Machine Control Structure
  - Guest-state-area
  - Host-state-area

- VM entry: root operation  $\rightarrow$  non-root operation
- $\bullet~\text{VM}$  exit: non-root operation  $\rightarrow$  root operation
- VMCS: Virtual Machine Control Structure
  - Guest-state-area
  - Host-state-area
- Entry/Exit loads/safes information using the proper area

• Contains elements comprising the state of the virtual CPU of a VMCS

- Contains elements comprising the state of the virtual CPU of a VMCS
- VM-exit requires loading certain registers (like segment registers, CR3, IRTR...)

- Contains elements comprising the state of the virtual CPU of a VMCS
- VM-exit requires loading certain registers (like segment registers, CR3, IRTR...)
- GSA contains fields for these registers

- Contains elements comprising the state of the virtual CPU of a VMCS
- VM-exit requires loading certain registers (like segment registers, CR3, IRTR...)
- GSA contains fields for these registers
- GSA contains fields for other information not readable via registers

- Contains elements comprising the state of the virtual CPU of a VMCS
- VM-exit requires loading certain registers (like segment registers, CR3, IRTR...)
- GSA contains fields for these registers
- GSA contains fields for other information not readable via registers
  - e.g. interruptability state

Natural-Width fields.
16-bits fields.
32-bits fields.
64-bits fields.

CopyLeft 2017, @Noteworthy (Intel Manuel of July 2017)

## **GUEST STATE AREA**

| GOEST STATE AREA                                                 |                                                      |   |             |               |                            |                            |              |              |  |  |
|------------------------------------------------------------------|------------------------------------------------------|---|-------------|---------------|----------------------------|----------------------------|--------------|--------------|--|--|
| CRO                                                              |                                                      | C | CR4         |               |                            |                            |              |              |  |  |
| DR7                                                              |                                                      |   |             |               |                            |                            |              |              |  |  |
| RSP                                                              | RIP RFLAGS                                           |   |             |               |                            |                            |              |              |  |  |
| CS                                                               | Selector                                             |   | Ba          | ase Address   | Segment Limit              |                            |              | Access Right |  |  |
| SS                                                               | Selector                                             |   | Ba          | ase Address   | Segment Limit Access Right |                            |              |              |  |  |
| DS                                                               | Selector                                             |   | Ba          | ase Address   | Segment Limit              |                            |              | Access Right |  |  |
| ES                                                               | Selector Base Address Segment Limit Access Righ      |   |             |               |                            |                            |              | Access Right |  |  |
| FS                                                               | Selector                                             |   | Ba          | ase Address   | Se                         | gment                      | Limit        | Access Right |  |  |
| GS                                                               | Selector                                             |   | Ba          | ase Address   | Segment Limit              |                            |              | Access Right |  |  |
| LDTR                                                             | Selector Base Address                                |   |             |               |                            | gment                      | Limit        | Access Right |  |  |
| TR                                                               | Selector Base Address                                |   |             |               |                            | Segment Limit Access Right |              |              |  |  |
| GDTR                                                             | Selector Base Address                                |   |             |               |                            | gment                      | Access Right |              |  |  |
| IDTR                                                             | Selector                                             |   | ase Address | Segment Limit |                            |                            | Access Right |              |  |  |
| IA32_DEBUGCTL                                                    | IA32_SYSENTER_CS IA32_SYSENTER_ESP IA32_SYSENTER_EIP |   |             |               |                            |                            |              | YSENTER_EIP  |  |  |
| IA32_PERF_GLOBAL_CTF                                             |                                                      |   |             |               |                            |                            |              |              |  |  |
| SMBASE                                                           |                                                      |   |             |               |                            |                            |              |              |  |  |
| Activity state                                                   | Activity state Interruptibility state                |   |             |               |                            |                            |              |              |  |  |
| Pending debug exceptions                                         |                                                      |   |             |               |                            |                            |              |              |  |  |
| VMCS link pointer                                                |                                                      |   |             |               |                            |                            |              |              |  |  |
| VMX-preemption timer value                                       |                                                      |   |             |               |                            |                            |              |              |  |  |
| Page-directory-pointer-table entries PDPTE0 PDPTE1 PDPTE2 PDPTE3 |                                                      |   |             |               |                            |                            |              | PDPTE3       |  |  |
| Guest interrupt status                                           |                                                      |   |             |               |                            |                            |              |              |  |  |
| PML index                                                        |                                                      |   |             |               |                            |                            |              |              |  |  |

## HOST STATE AREA

| CRO            |              | CF        | 3            | CR4               |  |  |  |
|----------------|--------------|-----------|--------------|-------------------|--|--|--|
|                | RSP          |           | RIP          |                   |  |  |  |
| CS             |              |           | Selector     |                   |  |  |  |
| SS             | Selector     |           |              |                   |  |  |  |
| DS             | Selector     |           |              |                   |  |  |  |
| ES             | Selector     |           |              |                   |  |  |  |
| FS             | Selector     |           | Base Address |                   |  |  |  |
| GS             | Selector     |           | Ba           | se Address        |  |  |  |
| TR             | Selector     |           | Ba           | se Address        |  |  |  |
| GDTR           | Base Address |           |              |                   |  |  |  |
| IDTR           | Base Address |           |              |                   |  |  |  |
| IA32_SYSENTE   | R_CS         | IA32_SYSE | NTER_ESP     | IA32_SYSENTER_EIP |  |  |  |
| IA32_PERF_GLOB | AL_CTRL      | IA32      | _PAT         | IA32_EFER         |  |  |  |

• Addressed using physical addresses

- Addressed using physical addresses
- not part of guest address space

- Addressed using physical addresses
- not part of guest address space
- hypervisor may run in different address space as guest (CR3 part of state)

- Addressed using physical addresses
- not part of guest address space
- hypervisor may run in different address space as guest (CR3 part of state)
- VM-exits leave detailed information on reason for exit in VMCS

- Addressed using physical addresses
- not part of guest address space
- hypervisor may run in different address space as guest (CR3 part of state)
- VM-exits leave detailed information on reason for exit in VMCS
  - exit reason

- Addressed using physical addresses
- not part of guest address space
- hypervisor may run in different address space as guest (CR3 part of state)
- VM-exits leave detailed information on reason for exit in VMCS
  - exit reason
  - exit qualification

## VM-EXIT CONTROL FIELDS

|                                       | Save debug controls              |                        |                    | lost ad                | dress space size                |                           | Load IA32_PERF_GLOBAL_CTRL     |               |  |  |
|---------------------------------------|----------------------------------|------------------------|--------------------|------------------------|---------------------------------|---------------------------|--------------------------------|---------------|--|--|
| VM-Exit Controls                      | Acknowledge interrupt            | Save IA32              | _PAT               | Load IA32_PA           | r S                             | ave IA32_EFER             | Load IA32_EFER                 |               |  |  |
|                                       | Save VMX preemption timer value  |                        |                    | Clear IA32_BNDCFGS     |                                 |                           | Conceal VM exits from Intel PT |               |  |  |
| VM-Exit Controls                      | VM-exit MSR-store count          |                        |                    |                        | VM-exit MSR                     | VM-exit MSR-store address |                                |               |  |  |
| for MSRs                              | VM-exit MSR-load count           |                        |                    |                        | VM-exit MSR-load address        |                           |                                |               |  |  |
| VM-EXIT INFORMATION FIELDS            |                                  |                        |                    |                        |                                 |                           |                                |               |  |  |
| Basic VM-Exit                         | Exi                              |                        | Exit qualification |                        |                                 |                           |                                |               |  |  |
| Information                           | Guest-linear address             |                        |                    | Guest-physical address |                                 |                           |                                | ress          |  |  |
| VM Exits Due to                       | VM-exit interruption information |                        |                    | ormation               | VM-exit interruption error code |                           |                                |               |  |  |
| VM Exits That Occur                   | IDT-vectoring informatic         |                        |                    | nation                 | IDT-vectoring error code        |                           |                                |               |  |  |
| VM Exits Due to Instruction Execution |                                  | VM-exit instruction le |                    |                        | length                          | ngth V                    |                                | n information |  |  |
|                                       |                                  | 1/0                    | RCX                | X I/O RSI              |                                 |                           | I/O RDI                        | I/O RIP       |  |  |
| VM-instruction error field            |                                  |                        |                    |                        |                                 |                           |                                |               |  |  |
|                                       |                                  |                        |                    |                        |                                 |                           |                                |               |  |  |

• Example: MOV CR

- Example: MOV CR
- Exit reason: "control register access"

- Example: MOV CR
- Exit reason: "control register access"
- Exit qualification:

- Example: MOV CR
- Exit reason: "control register access"
- Exit qualification:
  - which CR

- Example: MOV CR
- Exit reason: "control register access"
- Exit qualification:
  - which CR

- Example: MOV CR
- Exit reason: "control register access"
- Exit qualification:
  - which CR
  - direction (Rx $\rightarrow$ CR or CR $\rightarrow$ Rx)
  - register used

| CONTROL FIELDS                           |                                              |                                           |                         |                   |                      |                          |                       |                       |                     |  |
|------------------------------------------|----------------------------------------------|-------------------------------------------|-------------------------|-------------------|----------------------|--------------------------|-----------------------|-----------------------|---------------------|--|
| Pin-Based VM-                            | External-interrupt exiting                   |                                           |                         | NMI exiting       |                      |                          |                       | Virtual NMIs          |                     |  |
| <b>Execution Controls</b>                |                                              | Activate VMX-pre                          | timer Proce             |                   |                      | Proce                    | ess posted interrupts |                       |                     |  |
|                                          | Interrupt-window exiting                     |                                           |                         |                   |                      | Use TSC offsetting       |                       |                       |                     |  |
| Primary processor-                       | H                                            | HLT exiting                               | INVLPG exiting          |                   |                      | MWAIT exiting            |                       |                       | RDPMC exiting       |  |
| based                                    | R                                            | RDTSC exiting CR                          |                         |                   | CR3-load exiting     |                          | CR3-store exiting     |                       | CR8-load exiting    |  |
| VM-execution                             | CR8                                          | CR8-store exiting                         |                         | Use TPR shadow    |                      |                          | MI-window e           | MOV-DR exiting        |                     |  |
| controls                                 | Unconditional I/O exiting                    |                                           | Use I/O bitmaps         |                   | Monitor trap flag    |                          | Use MSR bitmaps       |                       |                     |  |
|                                          | MONITOR exiting                              |                                           | PAUS                    |                   | SE exiti             | E exiting Activa         |                       | te secondary controls |                     |  |
|                                          | Virtualize APIC accesses                     |                                           | Enable EPT              |                   |                      | Descriptor-table exiting |                       |                       | Enable RDTSCP       |  |
| Secondary                                | Virtual                                      | ize x2APIC mode                           | Enable VPID             |                   |                      |                          | WBINVD exiting        |                       | Unrestricted guest  |  |
| processor-based                          | API                                          | C-register virtualiza                     | ition Virtual-inte      |                   |                      | errupt delivery PA       |                       | AUSE-loop exiting     |                     |  |
| VM-execution                             | RD                                           | RAND exiting                              | Enable INVPCID          |                   |                      | Enable VM functions      |                       |                       | VMCS shadowing      |  |
| controls                                 | Enabl                                        | e ENCLS exiting                           | RDSEED exiting          |                   |                      | Enable PML               |                       |                       | EPT-violation #VE   |  |
| controis                                 | Conce                                        | nceal VMX non-root operation from Intel F |                         |                   |                      | Enable XSAVES/XRSTORS    |                       |                       |                     |  |
|                                          | 1                                            | Mode-based execu                          | te control              | ontrol for EPT    |                      |                          |                       | Use TSC scaling       |                     |  |
| Excepti                                  | on Bitma                                     | p                                         | I/O-Bitmap Addresses    |                   |                      |                          |                       |                       | -offset             |  |
| Guest/Host Masks f                       | or CR0                                       | Guest/Host Ma                             | asks for C              | R4                | Read S               | Shadow                   | /s for CR0            | Rea                   | ad Shadows for CR4  |  |
| CR3-target value 0                       | CR                                           | 3-target value 1                          | CR3-t                   | arget             | value 2              | C                        | R3-target val         | ue 3                  | CR3-target count    |  |
|                                          |                                              | APIC-access address                       |                         | ss Virtual-       |                      | APIC address             |                       |                       | TPR threshold       |  |
| APIC Virtualization                      | EC                                           | EOI-exit bitmap 0                         |                         | EOI-exit bitmap 1 |                      | EOI-exit bitmap 2        |                       | ap 2                  | EOI-exit bitmap 3   |  |
|                                          |                                              | Posted-interrupt                          | n vector                |                   | Posted-interrupt des |                          | scriptor address      |                       |                     |  |
| Read bitmap for low MSRs Read bitmap for |                                              |                                           | or high MSRs Write bit  |                   |                      |                          |                       |                       | bitmap for low MSRs |  |
| Executive-VMCS Pointer                   |                                              |                                           | Extended-Page-Table Poi |                   |                      | iter Virtual-Proce       |                       |                       | essor Identifier    |  |
| PLE_Gap                                  |                                              | PLE_Window                                | VM-fund                 |                   | controls             | 1                        | VMREAD bitmap VMWR    |                       | VMWRITE bitmap      |  |
|                                          | ENCLS-exiting bitmap                         |                                           |                         |                   |                      | PML address              |                       |                       |                     |  |
| Virtualization-excent                    | Virtualization-exception information address |                                           |                         | EPTP index        |                      |                          | XSS-exiting hitmap    |                       |                     |  |

• Virtualization Hardware Extensions for Intel and AMD

- Virtualization Hardware Extensions for Intel and AMD
- $\rightarrow\,$  substantially lower overheads for VMs

- Virtualization Hardware Extensions for Intel and AMD
- $\rightarrow\,$  substantially lower overheads for VMs
- $\rightarrow~$  better isolation

- Virtualization Hardware Extensions for Intel and AMD
- $\rightarrow\,$  substantially lower overheads for VMs
- $\rightarrow\,$  better isolation
- $\rightarrow\,$  IaaS VMs become widely used

• Support for interrupt-virtualization

- Support for interrupt-virtualization
  - VM-exit with every external interrupt (cannot be masked by guest)

- Support for interrupt-virtualization
  - VM-exit with every external interrupt (cannot be masked by guest)
  - VM-exit when guest-OS ready to accept interrupts (EFLAGS.IF==1)

- Support for interrupt-virtualization
  - VM-exit with every external interrupt (cannot be masked by guest)
  - VM-exit when guest-OS ready to accept interrupts (EFLAGS.IF==1)
- Support for CR0 and CR4-virtualization

- Support for interrupt-virtualization
  - VM-exit with every external interrupt (cannot be masked by guest)
  - VM-exit when guest-OS ready to accept interrupts (EFLAGS.IF==1)
- Support for CR0 and CR4-virtualization
  - VM-exit with any change of these registers

- Support for interrupt-virtualization
  - VM-exit with every external interrupt (cannot be masked by guest)
  - VM-exit when guest-OS ready to accept interrupts (EFLAGS.IF==1)
- Support for CR0 and CR4-virtualization
  - VM-exit with any change of these registers
  - can be set on which bits this shall happen

• Address Space Compression

- Address Space Compression
  - change of address space with any switch guest/hypervisor

- Address Space Compression
  - change of address space with any switch guest/hypervisor
  - guest owns full virtual address space

- Address Space Compression
  - change of address space with any switch guest/hypervisor
  - guest owns full virtual address space
- Ring Problems, SYSENTER/SYSEXIT

- Address Space Compression
  - change of address space with any switch guest/hypervisor
  - guest owns full virtual address space
- Ring Problems, SYSENTER/SYSEXIT
  - Guest can now run in ring 0

• Non-faulting Access to Privileged State

- Non-faulting Access to Privileged State
  - access raise fault into hypervisor

- Non-faulting Access to Privileged State
  - access raise fault into hypervisor
- Hidden State

- Non-faulting Access to Privileged State
  - access raise fault into hypervisor
- Hidden State
  - Saved into VMCS

• Hypervisor uses virtual memory

- Hypervisor uses virtual memory
- guest OS uses virtual memory

- Hypervisor uses virtual memory
- guest OS uses virtual memory
- hardware supports page tables

- Hypervisor uses virtual memory
- guest OS uses virtual memory
- hardware supports page tables
- how does this work?

- Hypervisor uses virtual memory
- guest OS uses virtual memory
- hardware supports page tables
- how does this work?
  - shadow page tables

- Hypervisor uses virtual memory
- guest OS uses virtual memory
- hardware supports page tables
- how does this work?
  - shadow page tables
  - hardware support



All problems in computer science can be solved by another level of indirection.

All problems in computer science can be solved by another level of indirection.

But that usually will create another problem.

David Wheeler





## and in 64 bit...



40

## Combined Paging









• merges both page tables into one that the HW uses

- merges both page tables into one that the HW uses
- when guest changes own page table

- merges both page tables into one that the HW uses
- when guest changes own page table
  - Hypervisor has to catch access

- merges both page tables into one that the HW uses
- when guest changes own page table
  - Hypervisor has to catch access
  - update shadow page table

• when HW changes shadow page table

- when HW changes shadow page table
- update guest PT

- when HW changes shadow page table
- update guest PT
  - expensive!

- when HW changes shadow page table
- update guest PT
  - expensive!
  - page faults caught by hypervisor

- when HW changes shadow page table
- update guest PT
  - expensive!
  - page faults caught by hypervisor
  - must run through guest PTs

- when HW changes shadow page table
- update guest PT
  - expensive!
  - page faults caught by hypervisor
  - must run through guest PTs
  - must emulate accessed and modified bits for guest



## Daniel Gruss

"guest page walk"



## Nested PT (NPT, AMD) / Extended PT (EPT, Intel)

• lots of memory accesses....

- lots of memory accesses....
- but how many exactly?















max. number of memory accesses per address translation

• 5 on guest level

max. number of memory accesses per address translation

- 5 on guest level
- each induces 5 on host level

max. number of memory accesses per address translation

- 5 on guest level
- each induces 5 on host level
- makes 25!

**Guest Page Walk** 



53

• depending on application: 3.9-4.6 times slower

- depending on application: 3.9-4.6 times slower
- but: TLB

• EPT only used if VM active

- EPT only used if VM active
- Translations tagged in TLB with EPT base pointer

- EPT only used if VM active
- Translations tagged in TLB with EPT base pointer
  - differentiate TLB-entries of different VMs

- EPT only used if VM active
- Translations tagged in TLB with EPT base pointer
  - differentiate TLB-entries of different VMs
  - TLB-flush per guest possible

- EPT only used if VM active
- Translations tagged in TLB with EPT base pointer
  - differentiate TLB-entries of different VMs
  - TLB-flush per guest possible
- VPID: virtual processor ID

- EPT only used if VM active
- Translations tagged in TLB with EPT base pointer
  - differentiate TLB-entries of different VMs
  - TLB-flush per guest possible
- VPID: virtual processor ID
  - unique value for each VM

- EPT only used if VM active
- Translations tagged in TLB with EPT base pointer
  - differentiate TLB-entries of different VMs
  - TLB-flush per guest possible
- VPID: virtual processor ID
  - unique value for each VM
  - translations tagged in TLB using VPID

| 6 6 6 6 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 | 5 M <sup>1</sup> | M-1 3 3 3<br>2 1 0                                                                                                                     | 2 2 2 2 2 2 2 2 2 2 2 2 9 8 7 6 5 4 3 2 1 | 2 1 1 1 1 1 1 1<br>0 9 8 7 6 5 4 3 | 1 1 1<br>2 1 0 9 8 7 6 5 4 3       | 2 1 0           |                                |
|-----------------------------------------|------------------|----------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|------------------------------------|------------------------------------|-----------------|--------------------------------|
| Reserved                                |                  | Address of EPT PML4 table                                                                                                              |                                           |                                    | Rsvd. S A EPT<br>S / PWL-<br>2 D 1 | EPT<br>PS<br>MT | EPTP <sup>3</sup>              |
| lgnored                                 | Rsvd.            | Address of EPT page-directory-pointer table                                                                                            |                                           |                                    |                                    | ξw R            | PML4E:<br>present <sup>6</sup> |
| S I I I I I I I I I I I I I I I I I I I |                  |                                                                                                                                        |                                           |                                    |                                    |                 | PML4E:<br>not<br>present       |
| S S S Ignored                           | Rsvd.            | Physical<br>address of<br>1GB page                                                                                                     | Reserved                                  |                                    |                                    | XWR             | PDPTE:<br>1GB<br>page          |
| Ignored                                 | Rsvd.            | Address of EPT page directory                                                                                                          |                                           |                                    |                                    |                 | PDPTE:<br>page<br>directory    |
| S I I I I I I I I I I I I I I I I I I I |                  |                                                                                                                                        |                                           |                                    |                                    |                 | PDTPE:<br>not<br>present       |
| S S<br>V Ign. S Ignored<br>E S          | Rsvd.            | Physic<br>of 2                                                                                                                         | al address<br>MB page                     | Reserved                           |                                    | XWR             | PDE:<br>2MB<br>page            |
| lgnored                                 | Rsvd.            | Address of EPT page table $\begin{bmatrix} Ig X & Ig \\ n, U & n \end{bmatrix} A \ \underline{0}$ Rsvd.                                |                                           |                                    |                                    |                 | PDE:<br>page<br>table          |
| S I I I I I I I I I I I I I I I I I I I |                  |                                                                                                                                        |                                           |                                    |                                    | <u>o</u> o o    | PDE:<br>not<br>present         |
| S Ig P S Ignored<br>E n. 9 S            | Rsvd.            | Physical address of 4KB page $\begin{bmatrix} Ig X \\ n, U \end{bmatrix} A \begin{vmatrix} I \\ g \\ A \\ n \\ T \end{vmatrix} EPT MT$ |                                           |                                    |                                    | XWR             | PTE:<br>4KB<br>page            |
| S / / / / / / / / / / / / / / / / / / / |                  |                                                                                                                                        |                                           |                                    |                                    |                 | PTE:<br>not<br>present         |

Figure 28-1. Formats of EPTP and EPT Paging-Structure Entries

1. Enable VMX via CR4

- 1. Enable VMX via CR4
- 2. Allocate a <code>VMXON</code> region and use the <code>VMXON</code> instruction

- 1. Enable VMX via CR4
- 2. Allocate a VMXON region and use the VMXON instruction
- 3. Allocate an MSR Bitmap region (we don't want a trap for all MSRs)

- 1. Enable VMX via CR4
- 2. Allocate a VMXON region and use the VMXON instruction
- 3. Allocate an MSR Bitmap region (we don't want a trap for all MSRs)
- 4. Use VMCLEAR instruction

- 1. Enable VMX via CR4
- 2. Allocate a VMXON region and use the VMXON instruction
- 3. Allocate an MSR Bitmap region (we don't want a trap for all MSRs)
- 4. Use VMCLEAR instruction
- 5. Execute <code>VMPTRLD</code> to make a VMCS the "current VMCS"

- 1. Enable VMX via CR4
- 2. Allocate a VMXON region and use the VMXON instruction
- 3. Allocate an MSR Bitmap region (we don't want a trap for all MSRs)
- 4. Use VMCLEAR instruction
- 5. Execute VMPTRLD to make a VMCS the "current VMCS"
- 6. Allocate a VMCS region and set up the VMCS (using VMWRITES)

- 1. Enable VMX via CR4
- 2. Allocate a VMXON region and use the VMXON instruction
- 3. Allocate an MSR Bitmap region (we don't want a trap for all MSRs)
- 4. Use VMCLEAR instruction
- 5. Execute VMPTRLD to make a VMCS the "current VMCS"
- 6. Allocate a VMCS region and set up the VMCS (using VMWRITES)
- $7. \ Use \ the \ {\tt VMLAUNCH}$

1. user needs help for some operations (e.g., HW interaction)

- 1. user needs help for some operations (e.g., HW interaction)
- $\rightarrow$  can use a syscall!

- 1. user needs help for some operations (e.g., HW interaction)
- $\rightarrow\,$  can use a syscall!
- 2. What about VMs?

- 1. user needs help for some operations (e.g., HW interaction)
- $\rightarrow\,$  can use a syscall!
- 2. What about VMs?
- 3. Same concept different level:

- 1. user needs help for some operations (e.g., HW interaction)
- $\rightarrow\,$  can use a syscall!
- 2. What about VMs?
- 3. Same concept different level:
- $\rightarrow$  Hypercalls!

- 1. user needs help for some operations (e.g., HW interaction)
- $\rightarrow\,$  can use a syscall!
- 2. What about VMs?
- 3. Same concept different level:
- $\rightarrow$  Hypercalls!

- 1. user needs help for some operations (e.g., HW interaction)
- $\rightarrow\,$  can use a syscall!
- 2. What about VMs?
- 3. Same concept different level:
- $\rightarrow$  Hypercalls! via the <code>vmcall</code> instruction

Optimization

• Full virtualization often not needed

Optimization

- Full virtualization often not needed
- Serverless / Edge Computing (it's still a form of cloud computing)

Optimization

- Full virtualization often not needed
- Serverless / Edge Computing (it's still a form of cloud computing)
- $\bullet$  Virtualization is not for free  $\rightarrow$  why not skip it and just use OS level isolation?

Optimization

- Full virtualization often not needed
- Serverless / Edge Computing (it's still a form of cloud computing)
- Virtualization is not for free  $\rightarrow$  why not skip it and just use OS level isolation?
- Context switches between processes are expensive  $\rightarrow$  why not skip process isolation and just use language-level isolation?

Cloud Operating Systems  $\rightarrow$  Hardware-assisted virtualization



# cts @gf\_256 · 5. Apr. 2020

Talk to your kids about hypervisors...before someone else does



• Seminar-style

- Seminar-style
- You code

- Seminar-style
- You code
- You plan

- Seminar-style
- You code
- You plan
- You present

Fabian Rauscher, Jonas Juffinger, Daniel Gruss

• 100 P. = 100%

- 100 P. = 100%
- 87.5 P.  $\rightarrow$  1

- 100 P. = 100%
- 87.5 P.  $\rightarrow$  1
- 75 P.  $\rightarrow$  2

- 100 P. = 100%
- 87.5  $\text{P.} \rightarrow 1$
- 75 P.  $\rightarrow$  2
- 62.5 P.  $\rightarrow$  3

- 100 P. = 100%
- 87.5 P.  $\rightarrow$  1
- 75 P.  $\rightarrow$  2
- $\bullet~$  62.5 P.  $\rightarrow$  3
- 50 P.  $\rightarrow$  4

• 15 participants ightarrow 4 teams with each 3-4 participants (default)

- 15 participants  $\rightarrow$  4 teams with each 3-4 participants (default)
- 5 ECTS = 500h with 125h per team member

- 15 participants  $\rightarrow$  4 teams with each 3-4 participants (default)
- 5 ECTS = 500h with 125h per team member
- Team of 3? Same effort but +5 points

- 15 participants  $\rightarrow$  4 teams with each 3-4 participants (default)
- 5 ECTS = 500h with 125h per team member
- Team of 3? Same effort but +5 points
- Team of 2? Same effort but +10 points

- 15 participants  $\rightarrow$  4 teams with each 3-4 participants (default)
- 5 ECTS = 500h with 125h per team member
- Team of 3? Same effort but +5 points
- Team of 2? Same effort but +10 points
- $\rightarrow\,$  send us your registration until Monday March 11

• Deadlines: Friday 23:59

- Deadlines: Friday 23:59
- Grace Period: 48 hours but no support

• 22.3. Structure Setup

Estimated Team Effort: 125h, Points: 5P.

- 22.3. Structure Setup Estimated Team Effort: 125h, Points: 5P.
- 26.4. Executing Guest Code + Video Output Estimated Team Effort: 125h, Points: 15P.  $\rightarrow$  AG1

- 22.3. Structure Setup Estimated Team Effort: 125h, Points: 5P.
- 26.4. Executing Guest Code + Video Output Estimated Team Effort: 125h, Points: 15P.  $\rightarrow$  AG1
- 3.5. Interrupt + Emulate PIC + Public Feature Bidding Estimated Team Effort: 100h, Points: 5P.

- 22.3. Structure Setup Estimated Team Effort: 125h, Points: 5P.
- 26.4. Executing Guest Code + Video Output Estimated Team Effort: 125h, Points: 15P. → AG1
- 3.5. Interrupt + Emulate PIC + Public Feature Bidding Estimated Team Effort: 100h, Points: 5P.
- 24.5. Boot Guest SWEB Shell + Virtualize Disk + Private Feature Bidding Estimated Team Effort: 75h, Points: 35P. → AG2

- 22.3. Structure Setup Estimated Team Effort: 125h, Points: 5P.
- 26.4. Executing Guest Code + Video Output Estimated Team Effort: 125h, Points: 15P. → AG1
- 3.5. Interrupt + Emulate PIC + Public Feature Bidding Estimated Team Effort: 100h, Points: 5P.
- 24.5. Boot Guest SWEB Shell + Virtualize Disk + Private Feature Bidding Estimated Team Effort: 75h, Points: 35P. → AG2
- 31.5. Feature PoC in Booted Guest SWEB Estimated Team Effort: 75h, Points: 10P.

- 22.3. Structure Setup Estimated Team Effort: 125h, Points: 5P.
- 26.4. Executing Guest Code + Video Output Estimated Team Effort: 125h, Points: 15P.  $\rightarrow$  AG1
- 3.5. Interrupt + Emulate PIC + Public Feature Bidding Estimated Team Effort: 100h, Points: 5P.
- 24.5. Boot Guest SWEB Shell + Virtualize Disk + Private Feature Bidding Estimated Team Effort: 75h, Points: 35P. → AG2
- 31.5. Feature PoC in Booted Guest SWEB Estimated Team Effort: 75h, Points: 10P.
- 14.6. Feature Implementation Done + Final Presentation and Demo in Booted Guest SWEB

Estimated Team Effort: 75h, Points: 30P.  $\rightarrow$  AG3

- 22.3. Structure Setup Estimated Team Effort: 125h, Points: 5P.
- 26.4. Executing Guest Code + Video Output Estimated Team Effort: 125h, Points: 15P.  $\rightarrow$  AG1
- 3.5. Interrupt + Emulate PIC + Public Feature Bidding Estimated Team Effort: 100h, Points: 5P.
- 24.5. Boot Guest SWEB Shell + Virtualize Disk + Private Feature Bidding Estimated Team Effort: 75h, Points: 35P. → AG2
- 31.5. Feature PoC in Booted Guest SWEB Estimated Team Effort: 75h, Points: 10P.
- 14.6. Feature Implementation Done + Final Presentation and Demo in Booted Guest SWEB

Estimated Team Effort: 75h, Points: 30P.  $\rightarrow$  AG3

• 14.6. Successful Live Presentation at 21:00, Bonus Points: 5P.