Linux Kernel Internals
Linux Kernel Internals
Outline
Linux Introduction Linux Kernel Architecture Linux Kernel Components
Linux Introduction
Linux Introduction
History Features Resources
Features
Free Open system Open source GNU GPL (General Public License) POSIX standard High portability High performance Robust Large development toolset Large number of device drivers Large number of application programs
Features (Cont.)
Multi-tasking Multi-user Multi-processing Virtual memory Monolithic kernel Loadable kernel modules Networking Shared libraries Support different file systems Support different executable file formats Support different networking protocols Support different architectures
Resources
Distributions Books Magazines Web sites ftp cites bbs
Hardware
System Structure
Processes
System calls interface
File systems
ext2fs minix iso9660 xiafs nfs proc msdos
Central kernel
Task management Scheduler Signals Loadable modules Memory management
Network Manager
ipv4 ethernet ...
Machine interface
Machine
Books
Understanding the Linux Kernel, D. P. Bovet and M. Cesati, O'Reilly & Associates, 2000. Linux Core Kernel Commentary, In-Depth Code Annotation, S. Maxwell, Coriolis Open Press, 1999. The Linux Kernel, Version 0.8-3, D. A Rusling, 1998. Linux Kernel Internals, 2nd edition, M. Beck et al., Addison-Wesley, 1998. Linux Kernel, R. Card et al., John Wiley & Sons, 1998.
Bootstrap and System Initialization Events From Power-On To Linux Kernel Running
Bootstrap and System Initialization Booting the PC (Events From Power On)
Perform POST procedure Select boot device Load bootstrap program (bootsect.S) from floppy or HD
Bootstrap program
Hardware Initialization (setup.S) loads Linux kernel into memory (head.S) Initializes the Linux kernel Turn bootstrap sequence to start the first init process
Runs shell
Low-level Hardware Resource Handling Interrupt handling Trap/Exception handling System call handling
Memory Management
It provides:
Large address spaces Protection Memory mapping Fair physical memory allocation Shared virtual memory
Memory Management
x86 Memory Management
Segmentation Paging
Segment Translation
15
0
Selector
31 Offset
logical address
Segment Descriptor
base address
linear address
Segment Descriptor Table Dir Page Offset
Directory
Table
Offset
12
10
10
Physical Address Page-Table Entry Directory Entry
Page table
32
Page directory
CR3(PDBR)
Physical memory
Offset
Linear Address
Table Offset Page Table Page Directory
Segment
Segment Descriptor
Page
VPFN4
VPFN3 VPFN2
VPFN4
VPFN3 VPFN2
VPFN1
VPFN0
VPFN1
VPFN0
Physical Memory
Virtual Memory
Virtual Memory
Page Address
DA
Demand Paging
Loading virtual pages into memory as they are accessed Page fault handling
faulting virtual address is invalid faulting virtual address was valid but the page is not currently in memory
Swapping
If a process needs to bring a virtual page into physical memory and there are no free physical pages available: Linux uses a Least Recently Used page aging technique to choose pages which might be removed from the system. Kernel Swap Daemon (kswapd)
Caches
To improve performance, Linux uses a number of memory management related caches:
Buffer Cache Page Caches Swap Cache Hardware Caches (Translation Look-aside Buffers)
Page Allocation and Deallocation Linux uses the Buddy algorithm to effectively allocate and deallocate blocks of pages. Pages are allocated in blocks which are powers of 2 in size.
If the block of pages found is larger than requested must be broken down until there is a block of the right size.
The page deallocation codes recombine pages into large blocks of free pages whenever it can.
Whenever a block of pages is freed, the adjacent or buddy block of the same size is checked to see if it is free.
VMALLOC_START
VMALLOC_END
Allocated space
Unallocated space
Process Management
What is a Process ?
A program in execution. A process includes program's instructions and data, program counter and all CPU's registers, process stacks containing temporary data. Each individual process runs in its own virtual address space and is not capable of interacting with another process except through secure, kernel managed mechanisms.
Linux Processes
Each process is represented by a task_struct data structure, containing:
Process State Scheduling Information Identifiers Inter-Process Communication Times and Timers File system Virtual memory Processor Specific Context
Process State
creation signal
stopped
signal termination
ready
scheduling
executing
zombie
Process Relationship
parent
p_pptr p_opptr
p_osptr
youngest child
p_ysptr
child
p_ysptr
oldest child
Managing Tasks
struct task_struct
pidhash
next_task prev_task
task
tarray_freelist
Scheduling
As well as the normal type of process, Linux supports real time processes. The scheduler treats real time processes differently from normal user processes Pre-emptive scheduling. Priority based scheduling algorithm Time-slice: 200ms Schedule: select the most deserving process to run
Priority: weight Normal : counter Real Time : counter + 1000
A Process's Files
current task_struct Table of open files Table of i-nodes
... files
...
...
...
...
...
...
Virtual Memory
A process's virtual memory contains executable code and data from many sources. Processes can allocate (virtual) memory to use during their processing Demand paging is used where the virtual memory of a process is brought into physical memory only when a process attempts to use it.
data
vm_area_struct
vm_end vm_start vm_flags vm_inode vm_ops vm_next
code
Executing Programs
Programs and commands are normally executed by a command interpreter. A command interpreter is a user process like any other process and is called a shell ex.sh, bash and tcsh Executable object files:
Contain executable code and data together with information to be loaded and executed by OS
Shell clone itself and binary image is replaced with executable image
ELF
ELF (Executable and Linkable Format) object file format
designed by Unix System Laboratories Format header the most commonly used format in Linux Physical header
(Code) Physical header (Data) Code Data
Signals
Signals inform processes of the occurrence of asynchronous events. Processes may send each other signals by kill system call, or kernel may send signals to a process. A set of defined signals in the system:
1)SIGHUP 5) SIGTRAP 9) SIGKILL 13) SIGPIPE 17) SIGCHLD 21) SIGTTIN 25) SIGXFSZ 29) SIGIO 2) SIGINT 6) SIGIOT 10) SIGUSR1 14) SIGALR 18) SIGCONT 22) SIGTTOU 26) SIGVTALRM 30) SIGPWR 3) SIGQUIT 4) SIGILL 7) SIGBUS 8) SIGFPE 11) SIGSEGV 12) SIGUSR2 15)SIGTERM 19) SIGSTOP 20) SIGTSTP 23) SIGURG 24) SIGXCPU 27) SIGPROF 28) SIGWINCH
Signals (Cont.)
A process can choose to block or handle signals itself or allow kernel to handle it Kernel handles signals using default actions.
E.g., SIGFPE(floating point exception) : core dump and exit
Pipes
one-way flow of data The writer and the reader communicate using standard read/write library function
Communication pipe Task A Task B
Signal
The only information transported is a simple number, which renders signals unsuitable for transferring data.
Key Management
Processes may access these IPC resources only by passing a unique reference identifier to the kernel via system calls. Senders and receivers must agree on a common key to find the reference identifier for the System V IPC object. Access to these System V IPC objects is checked using access permissions.
Semaphores
A semaphore is a location in memory whose value can be tested and set (atomic) by more than one processes Can be used to implement critical regions
Sys_shmget()
Create Segment Give a valid IPC identifier
Sys_shmat()
Process to attach segment For read and write
Sys_shmdt()
Sys_shmctl()
Semaphores
struct msqid_ds struct sems
IPC_UNUSED
Message Queues
Allow one or more processes to write messages, which will be read by one or more reading processes struct msqid_ds
IPC_UNUSED
File System
The real file systems are separated from the OS by an interface layer: Virtual File System: VFS VFS allows Linux to support many different file systems, each presenting a common software interface to the VFS.
bin
dev
etc
lib
sbin
usr
ls
cc
Mounting of Filesystems
/
mounting operation
bin
dev
etc
lib
sbin
usr
bin
include
lib
man
sbin
root filesystem
/usr filesystem
bin
dev
etc
lib
sbin
usr
bin
include
lib
man
sbin
ext2
msdos
minix
proc
Buffer cache
File system
Device drivers
directory : special file which contains pointers to the inodes of its directory entries Divides the logical partition that it occupies into Block Groups.
Block Group n
Super block
Group descriptors
Block bitmap
Inode bitmap
Inode table
Data blocks
Directory Format
minix
ext2fs
proc
To avoid fragmentation that file blocks may spread all over the file system, EXT2 file system:
Allocating the new blocks for a file physically close to its current data blocks or at least in the same Block Group as its current data blocks as possible. Block preallocation
Speedup Access
VFS Inode Cache Directory Cache
stores the mapping between the full directory names and their inode numbers.
Buffer Cache
All of the Linux file systems use a common buffer cache to cache data buffers from the underlying devices
Networking
socket( )
connect( )
inode
socket
type protocol data
fd[255]
SOCK_STREAM
sock
type protocol socket
Loading Modules
Kernel
Kernel
Compiled Kernel