sk_buff data structure
This is where a packet is stored. The structure is used by all the network layers to store their headers, information about the user data (the payload), and other information needed internally for coordinating their work.
unsigned char pkt_type -> specified frame packet type
PACKET_HOST
PACKET_MULTICAST
PACKET_BROADCAST
PACKET_OTHERHOST
PACKET_OUTGOING
PACKET_LOOPBACK
Packet type
23 /* Packet types */
24
25 #define PACKET_HOST 0 /* To us */
26 #define PACKET_BROADCAST 1 /* To all */
27 #define PACKET_MULTICAST 2 /* To group */
28 #define PACKET_OTHERHOST 3 /* To someone else */
29 #define PACKET_OUTGOING 4 /* Outgoing of any type */
30 #define PACKET_LOOPBACK 5 /* MC/BRD frame looped back */
31 #define PACKET_USER 6 /* To user space */
32 #define PACKET_KERNEL 7 /* To kernel space */
33 /* Unused, PACKET_FASTROUTE and PACKET_LOOPBACK are invisible to user space */
34 #define PACKET_FASTROUTE 6 /* Fastrouted frame */
Each network device is represented in the Linux kernel by this data structure, which contains information about both its hardware and its software configuration.
Network devices can be classified into types such as Ethernet cards and Token Ring cards.
The fields of the net_device structure can be classified into the following categories:
Configuration
Statistics
Device status
List management
Traffic management
Feature specific
Generic
Function pointers (or VFT)
Promiscuous mode
The net_device structure contains a counter named promiscuity that indicates a device is in promiscuous mode. The reason it is a counter rather than a simple flag is that several clients may ask for promiscuous mode; therefore, each increments the counter when entering the mode and decrements the counter when leaving the mode. The device does not leave promiscuous mode until the counter reaches zero. Usually the field is manipulated by calling the function dev_set_promiscuity.
Function Pointers
Such functions are used mainly to:
Transmit and receive a frame
Add or parse the link layer header on a buffer
Change a part of the configuration
Retrieve statistics
Interact with a specific feature
Different tools can be used to configure or dump the current status of media and hardware parameters for network devices. Among them are:
ifconfig and mii-tool, from the net-tools package
ethtool, from the ethtool package
ip link, from the IPROUTE2 package
When allocation and deallocation are expected to happen often, the associated kernel component initialization routine (for example, fib_hash_init for the routing table) usually allocates a special memory cache that will be used for the allocations.
Some examples of network data structures for which the kernel maintains dedicated memory caches include:
Socket buffer descriptors
Routing tables
Depending on the needs of any given feature, you will find two main kinds of garbage collection:
Asynchronous
A timer that expires regularly invokes a routine that scans a set of data structures and frees the ones considered eligible for deletion.
Synchronous
There are cases where a shortage of memory, which cannot wait for the asyn- chronous garbage collection timer to kick in, triggers immediate garbage collec- tion.
Function Pointers
A key advantage to using function pointers is that they can be initialized differently depending on various criteria and the role played by the object.
As an example, when a device driver registers a network device with the kernel, it goes through a series of steps that are needed regardless of the device type. At some point, it invokes a function pointer on the net_device data structure to let the device driver do something extra if needed. The device driver could either initial- ize that function pointer to a function of its own, or leave the pointer NULL because the default steps performed by the kernel are sufficient.
Compile-Time Optimization for Condition Checks
The ker- nel uses the likely and unlikely macros, respectively, to wrap comparisons that are likely to return a true (1) or false (0) result. Those macros take advantage of a feature of the gcc compiler that can optimize the compilation of the code based on that information.
Mutual Exclusion
Spin locks
Read-write spin locks
Read-Copy-Update (RCU)
An example where RCU is used in the networking code is the routing sub- system. Lookups are more frequent than updates on the cache, and the routine that implements the routing cache lookup does not block in the middle of the search.
The kernel exports internal information to user space via different interfaces.there are three special interfaces, two of which are virtual filesystems:
procfs (/proc filesystem)
This is a virtual filesystem, usually mounted in /proc, that allows the kernel to export internal information to user space in the form of files. The files don’t actually exist on disk, but they can be read through cat or more and written to with the > shell redirector; they even can be assigned permission like real files. The components of the kernel that create these files can therefore say who can read from or write to any file.
sysctl (/proc/sys directory)
This interface allows user space to read and modify the value of kernel variables. What the user sees as a file somewhere under /proc/sys is actually a kernel variable.
sysfs (/sys filesystem)
sysfs exports plenty of information in a very clean and organized way.
ioctl system call
The ioctl (input/output control) system call operates on a file and is usually used to implement operations needed by special devices that are not provided by the standard filesystem calls.
ifconfig command uses ioctl to communicate with the kernel. For example, when the system administrator types a command like ifconfig eth0 mtu 1250 to change the MTU of the interface eth0, ifconfig opens a socket, initializes a local data structure with the information received from the system administrator (data in the example), and passes it to the kernel with an ioctl call. SIOCSIFMTU is the command identifier.
struct ifreq data;
fd = socket(PF_INET, SOCK_DGRAM, 0);
< ... initialize "data" ...>
err = ioctl(fd, SIOCSIFMTU, &data);
Netlink socket
It is used for networking applications to communicate with the kernel. Most commands in the IPROUTE2 package use it. Netlink represents for Linux what the routing socket represents in the BSD world.
The Netlink socket, well described in RFC 3549, represents the preferred interface between user space and kernel for IP networking configuration.
A notification chain is simply a list of functions to execute when a given event occurs. Each function lets one other subsystem know about an event that occurred within, or was detected by, the subsystem calling the function.
Suppose that interface eth3 went down, due to a break in the network, an administrative command (such as ifconfig eth3 down) or a hardware failure. Networks D, E, and F would become unreachable by RT (and by systems in A, B, and C relying on RT for their connections) and should be removed from the routing table. Who is going to tell the routing subsystem about that interface failure? A notification chain.
for each notification chain there is a passive side (the notified) and an active side (the notifier), as in the so-called publish-and-subscribe model:
The notified are the subsystems that ask to be notified about the event and that provide a callback function to invoke.
The notifier is the subsystem that experiences an event and calls the callback function.
The functions executed are chosen by the notified subsystems. It is never up to the owner of the chain (the subsystem that generates the notifications) to decide what functions to execute. The owner simply defines the list; any kernel subsystem can register a callback function with that chain to receive the notification.
The kernel defines at least 10 different notification chains. Here we are interested in the ones that are used to signal events of particular importance to the networking code. The main ones are:
inetaddr_chain
Sends notifications about the insertion, removal, and change of an Internet Pro- tocol Version 4 (IPv4) address on a local interface.
netdev_chain
Sends notifications about the registration status of network devices.
A device driver can be loaded as either a module or a static component of the kernel. Furthermore, devices can be present at boot time or inserted (and removed) at runtime: the latter type of device, called a hot-pluggable device, includes USB, PCI CardBus, IEEE 1394 (also called FireWire by Apple), and others.
When the kernel boots up, it executes start_kernel, which initializes a bunch of sub- systems, as partially shown in Figure 5-1. run_init_process determines the first process run on the system, the parent of all other processes; it has a PID of 1 and never halts until the system is done. Normally the program run is init, part of the SysVinit package. However, the administrator can specify a different program through the init= boot time option.
Interaction Between Devices and Kernel
Nearly all devices (including NICs) interact with the kernel in one of two ways:
Polling
Driven on the kernel side. The kernel checks the device status at regular inter- vals to see if it has anything to say.
Interrupt
Driven on the device side. The device sends a hardware signal (by generating an interrupt) to the kernel when it needs the kernel’s attention.
Virtual Devices
A virtual device is an abstraction built on top of one or more real devices. The associ- ation between virtual devices and real devices can be many-to-many, as shown by the three models in Figure 5-4. It is also possible to build virtual devices on top of other virtual devices.
Linux allows you to define different kinds of virtual devices.
Bonding
802.1Q
Bridging
Aliasing interfaces
True equalizer (TEQL)
Tunnel interfaces
Interaction with the Kernel Network Stack
Virtual devices and real devices interact with the kernel in slightly different ways.
Virtual devices sometimes call register_netdevice and unregister_netdevice rather than their wrappers, and take care of locking by themselves. They may need to handle locking to keep the lock for a little longer than a real device does. With this approach, the lock could also be misused and hold longer than needed, by making it protect additional pieces of code (besides register_netdev) that could be protected in other ways.
Real devices cannot be unregistered (i.e., destroyed) with user commands; they can only be disabled. Real devices are unregistered at the time their drivers are unloaded (when loaded as modules, of course). Virtual devices, in contrast, may be created and unregistered with user commands, too. Whether this is possible depends on the virtual device driver’s design.
PCI devices are uniquely identified by a combination of parameters, including ven- dor, model, etc.
PCI device drivers register and unregister with the kernel with pci_register_driver and pci_unregister_driver, respectively.
Linux allows users to pass kernel configuration options to their boot loaders, which then pass the options to the kernel; experienced users can use this mechanism to fine-tune the kernel at boot time.
The registration of a network device takes place in the following situations:
Loading an NIC’s device driver
An NIC’s device driver is initialized at boot time if it is built into the kernel, and at runtime if it is loaded as a module. Whenever initialization occurs, all the NICs controlled by that driver are registered.
Inserting a hot-pluggable network device
When a user inserts a hot-pluggable NIC, the kernel notifies its driver, which then registers the device. (For the sake of simplicity, we’ll assume the device driver is already loaded.)
Link State Change Detection
When an NIC device driver detects the presence or absence of a carrier or signal, either because it was notified by the NIC or via an explicit check by reading a config- uration register on the NIC, it can notify the kernel with netif_carrier_on and netif_carrier_off, respectively.
Here are a few common cases that may lead to a link state change:
A cable is plugged into or unplugged from an NIC.
The device at the other end of the cable is powered down or disabled. Examples of devices include hubs, bridges, routers, and PC NICs.
Destination NAT (DNAT)
Destination NAT, also called Route NAT in IPROUTE2 terminology, allows a host to define dummy (NAT) addresses: ingress packets addressed to them are detected by the host and forwarded to another address.
Useful link: https://www.quora.com/What-happens-once-a-system-call-is-made-by-a-process-in-user-space
A system call is a call to a function that is not part of the application but is inside the kernel.
For example, You always ultimately use write() to write anything on a peripheral whatever is the kind of device you write on. write() is designed to only write a sequence of bytes, that's all and nothing more. But as write() is considered too basic (you may want to write an integer in ten basis, or a float number in scientific notation, etc), different libraries are provided to you by different kind of programming environments to ease you. For example, C provides printf library function.
Useful link:https://stackoverflow.com/questions/21084218/difference-between-write-and-printf
Separation of privileges and security. For one, userspace programs can make their stack(pointer) anything they want, and there is usually no architectural requirement to even have a valid one. The kernel therefore cannot trust the userspace stackpointer to be valid nor usable, and therefore will require one set under its own control. Different CPU architectures implement this in different ways; x86 CPUs automatically switch stackpointers when privilege mode switches occur, and the values to be used for different privilege levels are configurable - by privileged code (i.e. only the kernel).
Useful link: https://stackoverflow.com/questions/12911841/kernel-stack-and-user-space-stack
Not just each process - each thread has its own kernel stack (and, in fact, its own user stack as well). As threads are created (and each process must have at least one thread), the kernel creates kernel stacks for them.
Useful link: https://stackoverflow.com/questions/12911841/kernel-stack-and-user-space-stack
A signal is generated either by the kernel internally (for example, SIGSEGV when an invalid address is accessed, or SIGQUIT when you hit Ctrl+\), or by a program using the killsyscall (or several related ones)
If it’s by one of the syscalls, then the kernel confirms the calling process has sufficient privileges to send the signal. If not, an error is returned (and the signal doesn't happen).
If it’s one of two special signals, the kernel unconditionally acts on it, without any input from the target process. The two special signals are SIGKILL and SIGSTOP. All the stuff below about default actions, blocking signals, etc., are irrelevant for these two.
you can also trigger an exception from software. For example, the user presses CTRL-C, then this call goes to the kernel which calls its own exception handler.
Next, the kernel figures out what do with the signal:
For each process, there is an action associated with each signal. There are a bunch of defaults, and programs can set different ones using sigaction, signal, etc. These include things like "ignore it completely", "kill the process", "kill the process with a core dump", "stop the process", etc.
Programs can also turn off the delivery of signals ("blocked"), on a signal-by-signal basis. Then the signal stays pending until unblocked.
Programs can request that instead of the kernel taking some action itself, it deliver the signal to the process either synchronously (with sigwait, et. al. or signalfd) or asynchronously (by interrupting whatever the process is doing, and calling a specified function).
Signal and threads
The exception handler decides what thread should receive the signal. If something like division-by-zero occurred, then it is easy: the thread that caused the exception gets the signal, but for other types of signals, the decision can be very complex and in some unusual cases a more or less random thread might get the signal.
Useful link: https://unix.stackexchange.com/questions/80044/how-signals-work-internally
http://pdf.th7.cn/down/files/1312/understanding_linux_network_internals.pdf