Minix Networking Documentation

This documentation is written by Charlie Tsz-Hong Wong and is copyrighted to him. Do not use or copy this without permission from him. In general though since I wish this documentation to be useful for educational purposes I would gladly consider any requests to be used for such purposes.

This is documentation for the source code of Minix networking. It describes in detail the basic parts of how the networking system works. The Minix networking source code is in the inet/ directory. The documentation assumes that the reader has a basic understanding of the ip, tcp, and udp protocols. It also assumes the reader has an understanding of how Minix works and basic understanding of the C programming language. We shall begin with the main function at inet/inet.c.

Minix networking runs as a process called ‘inet.’ Inet works just like every other operating system process - like the file system process (fs) and the memory management process (mm). The basic structure of a operating system process can be described as follows in the following c-like program:

#include <minix/type.h>

#define TRUE 1

int main()

{

init();

while (TRUE)

{

message m;

receive(&m);

processmessage(&m);

}

init() initializes the inet process. After inet is initialized inet goes into an endless loop as indicated by the ‘while (TRUE)’ statement. Inet does 2 things while in the endless loop. First it waits for a message to be sent to it (usually by the file system process fs) in receive(&m). Second, after it receives the message, it processes the message in processmessage(&m).

The main function (which starts the inet process) is at inet/inet.c.

User processes do not directly call the process inet when it wants to do something. Instead it calls the file system process fs. An example of calling fs in order to user minix networking is given in the ping program for Minix (commands/simple/ping.c)

        int fd, i;

        int result, result1;

        nwio_ipopt_t ipopt;

        fd= open ("/dev/ip", O_RDWR);

        ipopt.nwio_flags= NWIO_COPY | NWIO_PROTOSPEC;

        ipopt.nwio_proto= 1;

        result= ioctl (fd, NWIOSIPOPT, &ipopt);

First, ping calls the system function open in order to get a file descriptor fd (ie an ip channel) to the ip protocol in the inet process. Afterwards ping calls the system function ioctl to configure the device referred to by the file descriptor fd. ioctl is a standard unix system call which one uses on a file descriptor. The specifics of how Minix implements ioctl in inet shall be described later.

The file “/dev/ip” is a device file. The file type of a file is specified in the i_mode field of its inode structure. The inode structure is defined in fs/inode.h. Specifically it’s a character file. In Minix opening a device file is implemented as follows.

1) Every device file has a major number and a minor number attached to it.

2) The major number is an index in to the array dmap defined in fs/table.c. For the inet process in standard Minix 2.0.0, the major device number for inet is 7 = /dev/ip (look at fs/table.c).

3) The minor number is passed as a parameter in the function dmap.dmap_open.

The dmap structure is defined in the file fs/dev.h.

extern struct dmap {

  dmap_t dmap_open;

  dmap_t dmap_rw;

  dmap_t dmap_close;

  int dmap_task;

} dmap[];

For a major device number which corresponds to a driver for a real hardware device (say a hard disk drive) the minor device number typically corresponds to the specific device. This is because the same driver can be the driver of more than one hardware device (say more than one hard disk). For example a hard disk driver can handle more than one hard disk drive ie hard disk drive 0 and hard disk drive 1.

The major device number for inet does not correspond to a real device driver. This is because even though it’s typical to do networking via an Ethernet card, one doesn’t have to. One could do networking through, say, a modem. Or, say, an internal cable modem. The inet process was written to be versatile enough so that one doesn’t have to change the inet source code in order to use the ip protocol if one decides to use a modem instead of a network card ie inet can send datagrams on different network interfaces.

Since the major device number does not correspond to a real device driver, neither does the minor device number generally correspond to real device. Instead (generally) it corresponds to a layer in a networking protocol.

The minor device numbers for standard Minix 2.0.0 are given in inet/generic/sr.h

#define ETH_DEV         ETH_DEV0

#define IP_DEV          IP_DEV0

#define ETH_DEV0        0x01

#define IP_DEV0         0x02

#define TCP_DEV0        0x03

#define UDP_DEV0        0x04

#define ETH_DEV1        0x11

#define IP_DEV1         0x12

#define TCP_DEV1        0x13

#define UDP_DEV1        0x14

#define PSIP_DEV0        0x21

#define IP_DEV2         0x22

#define TCP_DEV2        0x23

#define UDP_DEV2        0x24

#define PSIP_DEV1        0x31

#define IP_DEV3         0x32

#define TCP_DEV3        0x33

#define UDP_DEV3        0x34

The defines in inet/generic/sr.h which begin with ETH correspond to the Ethernet layer ie a network card. For example ETH_DEV0 is the minor device number for a network card. The 3 other minor devices numbers which are defined immediately afterwards correspond to protocols which use the network card. Hence, writing to the inet device with the minor number IP_DEV0 would mean one is sending an ip datagram through the network card referred to with minor device number ETH_DEV0. Similarly, writing to the inet device with the minor number TCP_DEV0 would mean one is sending an tcp/ip datagram through the network card referred to with minor device number ETH_DEV0. Similarly, writing to the inet device with the minor number UDP_DEV0 would mean one is sending an udp/ip datagram through the network card referred to with minor device number ETH_DEV0.

Similar comments are also true concerning the minor device number ETH_DEV1.

The minor device number can therefore be understood as uniquely identifying an ordered pair:

(network interface, protocol).

PSIP_DEV0 is the minor device number for a pseudo-ip device. A pseudo-ip device can be understood in terms of another well known pseudo device in unix (and minix) ie pseudo-terminals.

In minix a terminal device can be called /dev/ttyp0 while the corresponding pseudo terminal device would be called /dev/ptyp0. The pseudo-terminal device and the corresponding terminal device form a bi-directional (2 way) pipe. Hence when one writes to ttyp0, one can read what was just written by reading ptyp0. Similarly if one were to write to ptyp0, one can read what was just written by reading ttyp0.

PSIP_DEV0, the pseudo-ip device, plays a similar role to the pseudo-terminal device while IP_DEV2 plays a similar role to the terminal device. When one writes to IP_DEV2, one can read what was just written by reading PSIP_DEV0. Similarly if one were to write to PSIP_DEV0, one can read what was just written by reading IP_DEV2.

The pseudo-ip devices however offer more options than the pseudo-terminal devices.  If one wants to read from the ip device IP_DEV2 without reading the ip header on can call

ioctl(fd, NWIOSIPOPT, &struct nwio_ipopt)

with the option NWIO_RWDATONLY set so that one can read to and write from the ip device IP_DEV2 without adding or retrieving the ip header.  For details see the ip.4 man page.

Using pseudo-ip devices one can easily write a user process program which would allow one to connect to the internet using a serial modem even though there is no serial modem driver in the inet process.  More generally, one can send ip messages using different network interfaces.

The outline for using a modem on com1 and ppp as a network interface would be as follows.

int main

int fd;

char buffer[512];

pid_t pid;

fd= open ("/dev/psip", O_RDWR);

if ((pid = fork()) == -1) {

                    exit(0); /*handle error*/

                if pid != 0 {           /*parent process*/

                                while(1)

                ReadCom1(buffer);

                                RemovePPPHeaders(buffer);

write(fd, buff, sizeof(buff));

                }  else {                /*child process*/

                                while(1)

read(fd, buff, sizeof(buff));

AddPPPHeaders(buffer);

                WriteCom1(buffer);

We make the assumption that the ppp protocol is being used as network interface.

The open statement returns a file descriptor to the psip (pseudo-ip) device.  The program then executes the fork system command so that can be a pseudo-ip reader (which shall write to the modem) and a pseudo-ip writer (which shall read from the modem).

The parent process is the pseudo-ip reader (which shall write to the modem).  The parent process  goes in an endless loop.  First it reads from the modem.  Next it removes the ppp protocol header.  Finally it writes to the pseudo-ip device.

The child process is the pseudo-ip writer (which shall read from the modem).  The child process  goes in an endless loop.  First it reads from the pseudo-ip device.  Next it adds the ppp protocol header.  Finally it writes to the modem.

fd tables

Every protocol (tcp,ip, udp) has a fd table.  The fd table holds the protocol specific information for the file descriptor and also information specific to the file descriptor.

port tables

Every protocol (tcp,ip, udp) has a port table.  The port table corresponds to the minor device numbers for the protocol.  There’s a 1-1 correspondence between the elements of the port table and the minor device numbers for the protocol and hence the minor device numbers for the network interfaces.

How inet handles messages

Inet starts in inet/inet.c in the main function.

When inet receives a message in the main function it puts the message in the message queue.

r= receive (ANY, &mq->mq_mess);

But first the message queue must be initialized.

As previously stated inet runs in an endless loop receiving messages and processing messages – just like the file system process (fs) and the memory management process (mm).  But before it goes in to the endless loop, it first does initialization.

Inet calls nw_init() which does the initilization.  nw_init() calls mq_init().  We shall discuss the other functions it calls later.

mq.c

mq_init is defined in inet/mq.c.

The message queue structure is defined in inet/mq.h.

typedef struct mq

        message mq_mess;

        struct mq *mq_next;

        int mq_allocated;

} mq_t;

and inet/mq.c.

#define MQ_SIZE         128

PRIVATE mq_t mq_list[MQ_SIZE];

mq_message is the message structure.  The message queue is both an array and a linked list when it is first initialized.

After mq_init() is called the mq_next field of every member of the mq_list array points to the preceding member in the array. Also mq_init() makes mq_freelist point to the end of the array.  mq_freelist points to the head of the list of free (unused) elements in the mq_list array.

mq_get returns an item from the mq_list array and in the mq_freelist linked list.  It’s marked as allocated by setting mq_allocated = 1 and removing it from the mq_freelist linked list.

A message item is freed by calling mq_free(mq) where mq is the message item.  The message item is marked as freed by setting mq_allocated = 0 and adding it back to the mq_freelist linked list.

Back to main in inet.c

The message loop for inet begins with

while (TRUE)

in the main function.

inet gets a free message item from the message queue to place a new message in by calling

mq= mq_get();

It then waits for the message by calling

r= receive (ANY, &mq->mq_mess);

Typically the message comes from the file system which a user process calls in order to use inet.  Such a message is processed by calling

sr_rec(mq);

How minor devices relate to sr.c

The minor number of the device is an index in to the array sr_fd_table.  An internet protocol (eg tcp, udp, ip) is started in inet by calling the corresponding initialization routine (eg tcp_init(), udp_init(), ip_init()).  All of the initialization routines are started in inet.c in nw_init(). The corresponding initialization routine calls sr_add_minor in sr.c which passes the minor device number as a parameter in the the parameter minor.  Minor is an index in to the array sr_fd_table.  It gets the element of the array with the index equal to minor and sets the various properties of the element with the index.  It then sets the flag of the element so that the element is marked as a element corresponding to a minor device (by setting the flag SFF_MINOR )  and as being used (by setting the flag SFF_INUSE ).

sr.c continued

The repl_queue is used to handle deadlocks

[Expand here]

sr_rec handles the different types of messages passed to inet.  We shall go through each of the types of messages it handles.

A)                mq_t *m;

                    case DEV_OPEN:

                result= sr_open(&m->mq_mess);

                send_reply= 1;

                free_mess= 1;

                break;

sr_open is called when a message to open is received.  m is a pointer to a message item.  mq_mess is the message.  main passes a pointer to the message to sr_open.

sr_open returns an index to the first unused entry in the sr_fd_table array.  The sr_fd_table array is an array of sr_fd structures.  The file descriptor returned to the user process corresponds to an entry in the sr_fd_table array.  The entry in the sr_fd_table array is marked as used (by setting the flag SFF_INUSE ).

typedef struct sr_fd

        int srf_flags;

        int srf_fd;

        int srf_port;

        sr_open_t srf_open;

        sr_close_t srf_close;

        sr_write_t srf_write;

        sr_read_t srf_read;

        sr_ioctl_t srf_ioctl;

        sr_cancel_t srf_cancel;

        mq_t *srf_ioctl_q, *srf_ioctl_q_tail;

        mq_t *srf_read_q, *srf_read_q_tail;

        mq_t *srf_write_q, *srf_write_q_tail;

} sr_fd_t;

srf_flags: flags which describe state of sr_fd.

srf_fd:  is an index to the corresponding port table for the protocol or layer (eg tcp, udp, ip, psip, etc.)

srf_port: is an index in to the corresponding fd table for the protocol or layer (eg tcp, udp, ip, psip, etc.)

srf_open:  is a pointer to the open function of the corresponding port table for the protocol or layer (eg tcp, udp, ip, psip, etc.)

srf_close:  is a pointer to the close function of the corresponding port table for the protocol or layer (eg tcp, udp, ip, psip, etc.)

srf_write:  is a pointer to the write function of the corresponding port table for the protocol or layer (eg tcp, udp, ip, psip, etc.)

srf_read:  is a pointer to the read function of the corresponding port table for the protocol or layer (eg tcp, udp, ip, psip, etc.)

srf_ioctl:  is a pointer to the ioctl function of the corresponding port table for the protocol or layer (eg tcp, udp, ip, psip, etc.)

srf_cancel:  is a pointer to the cancel function of the corresponding port table for the protocol or layer (eg tcp, udp, ip, psip, etc.)

srf_ioctl_q: is a pointer to the head of the linked list of ioctl messages which are waiting to be completed (and hence the process is suspended)

srf_ioctl_q_tail: is a pointer to the end of the linked list of ioctl messages which are waiting to be completed (and hence the process is suspended)

srf_read_q: is a pointer to the head of the linked list of read messages which are waiting to be completed (and hence the process is suspended)

srf_read_q_tail: is a pointer to the end of the linked list of read messages which are waiting to be completed (and hence the process is suspended)

srf_write_q: is a pointer to the head of the linked list of write messages which are waiting to be completed (and hence the process is suspended)

srf_write_q_tail: is a pointer to the end of the linked list of write messages which are waiting to be completed (and hence the process is suspended)

The values of certain elements of the sr_fd_table array are set in the initialization rountines (eg tcp_init(), udp_init(), ip_init()) as described above in How minor devices relate to sr.c.  In particular the initialization rountines set the values of the elements whose index is equal to the minor device number of  a protocol/layer (eg ip, tcp, udp, ethernet, psip, etc.).  The initialization is done by calling sr_add_minor.

sr_add_minor sets the fields of the sr_fd structure to the protocol specific values except for srf_flags and srf_port.  srf_flags is used to store the state of the sr_fd structure.  srf_port is not set till sr_open is called.

Before sr_open marks the element in the sr_fd_table array is marked as used, it copies the entry in the sr_fd_table array with the index equal to the requested minor number to the unused element.  The element with the index equal to the requested minor number should have been initialized by sr_add_minor during the initialization rountine.  After being marked as used, the element in the sr_fd_table array calls the srf_open function of the element in the sr_fd_table array.  The srf_open function tells the protocol/layer what functions to calls when it needs to get/put data from/to the user process.  It also tells the  protocol/layer the index to the sr_fd_table array.  It returns an index to the fd table for the protocol/layer.  The sr_fd field of the element in the sr_fd_table array is set to the index of the fd table for the protocol/layer.  sr_open then returns the index of the element in the sr_fd_table array.

The index of the element in the sr_fd_table array is ultimately returned to the user process by the fs process as follows.  When the user process does an open system call to inet, fs calls net_open in fs/device.c.  When the index of the element in the sr_fd_table array is returned to fs, fs (in the function net_open) replaces the minor device number of the protocol/layer with the index of the element in the sr_fd_table array.

inet returns the minor device number to fs by calling sr_reply.  More generally, all replies to fs are done by calling sr_reply.

sr_reply sends a reply to the user process which originally sent the message to inet.  It attempts to revive the user process.

B)      mq_t *m;

        case DEV_CLOSE:

                sr_close(&m->mq_mess);

                result= OK;

                send_reply= 1;

                free_mess= 1;

                break;

sr_close closes a file descriptor.  The user process passes the index of the element in the sr_fd_table array which it wants to release and mark as unused.  sr_close calls the protocol/layer specific close function when it calls

        (*sr_fd->srf_close)(sr_fd->srf_fd);

Then it marks the element in the sr_fd_table array as free.

C)      mq_t *m;

        case DEV_READ:

        case DEV_WRITE:

        case DEV_IOCTL:

                result= sr_rwio(m);

                assert(result == OK || result == SUSPEND);

                send_reply= (result == SUSPEND);

                free_mess= 0;

                break;

Each element of the sr_fd_table is called a channel.

sr_rwio(m) is called for read, writes, and ioctls.  sr_rwio calls sr_getchannel to retrieve the element in the sr_fd_table (ie channel) with the index passed as a parameter to inet.  The index is passed in the minor device field of the message to inet.  sr_rwio puts the message at the respective read/write/ioctl queue.  Although inet requires a user process may only make one read/write/ioctl request at a time,  inet allows more than one user process to make such a request for any one channel.  If such a request is made while inet is busy processing another request for the same channel, sr_rwio puts the message (which performs the request) at the end of the queue and suspends the user process.  If  no other request is being processed, sr_rwio empties the queue and the message is put in to the queue.  Whether or not inet is busy processing another request for the same channel is determined by checking the SFF_READ_IP, SFF_WRITE_IP, SFF_IOCTL_IP flags.  If inet is not busy processing another request for the same channel, inet calls the protocol/layer specific functions ie

        r= (*sr_fd->srf_read)(sr_fd->srf_fd, m->mq_mess.COUNT);

or

        r= (*sr_fd->srf_write)(sr_fd->srf_fd, m->mq_mess.COUNT);

or

        r= (*sr_fd->srf_ioctl)(sr_fd->srf_fd, request);

If  the request cannot be completed SUSPEND is returned and the srf_flag for the channel is marked as suspended by setting the suspend flag SFF_READ_SUSP, SFF_WRITE_SUSP, or SFF_IOCTL_SUSP. sr_rwio then returns a code to sr_rec telling sr_rec whether or not to suspend the user process.  If sr_rec is told to suspend the user process, sr_rec calls sr_reply to inform fs to suspend the user process.  If sr_rec is not told to suspend the user process, sr_rec does not call sr_reply because the protocol/layer specific functions should have already replied to the user process calling sr_put_userdata.

For an ioctl call, you can return an integer result to the user either by calling sr_put_userdata(x, y, 0, 1) or sr_get_userdata(x, y, 0, 1)

buf.c

We shall now discuss how inet allocates memory to store the data which is sent and received.  This is handled in buf.c. bf_init() is called in inet.c when the inet process starts running and runs the initialization procedures.

bf_init initializes the buffer used to store data.  We shall only handle the case when storage is allocated without malloc - by using arrays (this is done by not defining BUF_USEMALLOC) - and where each buffer stores 512 bytes of data (this is done by defining BUF512_NR).

The buffer structure is declared in

#define DECLARE_TYPE(Tag, Type, Size)                           \

        typedef struct Tag                                     \

        {                                               \

                buf_t buf_header;                             \

                char buf_data[Size];                             \

        } Type

DECLARE_TYPE(buf512, buf512_t, 512);

buf_data is the array which stores the data.  Buf512_t is the structure that holds buf_data.

We shall now go over the buf_t structure.  buf_t  is defined in buf.h.  Ignoring certain defines buf_t is defined as follows:

typedef struct buf

        int buf_linkC;

        buffree_t buf_free;

        size_t buf_size;

        char *buf_data_p;

} buf_t;

buf_linkC: holds the number of variables which points to this structure.  If buf_linkC = 0 the buffer goes back to one of the free lists.

buf_free: pointer to the function which is called to free the acc_t structures which holds the buffer.

buf_size: amount (in byes) of data held in buf_data.

buf_data_p: pointer to the buf_data array which holds the data.

We shall now go over the acc_t structure.  acc_t  is defined in buf.h.  In general the specific implementations of the protocols (as in udp.c) access the buf_t structure indirectly throught the acc_t structure.  Ignoring certain defines acc_t is defined as follows:

typedef struct acc

        int acc_linkC;

        int acc_offset, acc_length;

        buf_t *acc_buffer;

        struct acc *acc_next, *acc_ext_link;

} acc_t;

acc_linkC:  holds the number of variables which points to this structure.  If acc_linkC = 0 the buffer goes back to one of the free lists.

acc_offset:  holds an offset in to the buffer data.

acc_length:  length of buffer data which acc_t structure is storing.

acc_buffer:  points to buf_t structure which stores the data.

acc_next:  points to next acc_t structure in linked list.  When the acc_t structure is not being used it points to the next free item in a linked list.  When it’s not free, the linked list is used in the following way for example:  one acc_t structure points to the ip header,  acc_next points to the udp header., the acc_next of that acc_t structure points to the data, etc.  In the case the acc_t structure is not free, the different buffers which belong to the different acc_t structures in the linked list are considered to store the same chunk of data.  Therefore if the size of the array pointed to by buf_data_p is not large enough to hold the chunk of data desired, new acc_t structures (with their own buf_data_p  arrays) are used until there are enough buf_data_p buffers to hold all of data.  The acc_t structures are linked together by the acc_next pointer.

acc_ext_link:  points to next acc_t structure in linked list.  It’s used by the implementation of a protocol.

One define which is used quite often is the following from buf.h

#define ptr2acc_data(/* acc_t * */ a) (bf_temporary_acc=(a), \

        (&bf_temporary_acc->acc_buffer->buf_data_p[bf_temporary_acc-> \

                acc_offset]))

ptr2acc_data returns a pointer to the buffer data which is stored in the buf_data array in the buf512_t  structure.  The returned buffer is used to read data which has been received or to send data which shall be sent.

We begin with bf_init().  bf_init() declares an array of buffer structures buf_t called buffers512.  bf_init() next declares an array of acc_t structures called accessors.

acc_freelist points to the head of the free list of the acc_t structures which do not have a acc_buffer buf_t structure allocated to them.  It is used by bf_dupacc to return a copy of the original acc_t structure.  Since the acc_t structure stores a pointer to the buf_t structure, bf_dupacc merely sets the pointer to the buf_t structure in the original buffer and no new buf_t structure need be allocated to it.

bf512_freelist points to the head of the free list of the acc_t structures which do have a acc_buffer buf_t structure allocated to them. It is used by bf_memreq.

bf_logon() defines the freereq array which is called when a freelist runs out of free acc_t  structures.

bf_bufsize(acc_ptr) returns the data size of the acc_t structure pointed to by acc_ptr.  As mentioned before, the different buffers which belong to the different acc_t structures in the linked list linked together by the acc_next pointer are considered to store the same chunk of data.  Accordingly in order to return the buffer size of acc_ptr it sums the buffer size (as determined by acc_length) of the linked list of acc_t structures where acc_ptr points to the head of list.

bf_memreq(size) returns a acc_t structure whose bf_bufsize is size.  It calls freereq if there are not any free acc_t structures in the buf512_freelist free list.

bf_afree(acc) frees a acc_t structure.  If the buf_t structure acc_buffer is still being used by another acc_t structure, acc is put on the acc_freelist; else acc is put on the bf512_freelist.

bf_dupacc(acc_ptr) returns a copy of the original acc_t structure pointed to by acc_ptr

bf_packIffLess(pack, min_len).

bf_pack(old_acc)

bf_cut(data, offset, length) returns a copy of  the chunk of data starting with offset and ending at offset + length - 1.

bf_delhead (data, offset) returns a copy of  the chunk of data starting with offset.  It frees the original acc_t.

bf_align(acc, size, alignment)

udp.c

upd is initialized by calling udp_init in nw_init in inet.c.  First udp_init initializes the fd table udp_fd_table.  Next udp_init initializes the port table udp_port_table.

As mentioned before, there’s a 1-1 correspondence between the elements of the port table and the minor device numbers for the protocol and hence the minor device numbers for the network interfaces.  The structure of the elements of the port table udp_port_table are defined in udp.c as

typedef struct udp_port

        int up_flags;

        int up_state;

        int up_ipfd;

        int up_minor;

        int up_ipdev;

        acc_t *up_wr_pack;

        ipaddr_t up_ipaddr;

        struct udp_fd *up_next_fd;

        struct udp_fd *up_write_fd;

        struct udp_fd *up_port_any;

        struct udp_fd *up_port_hash[UDP_PORT_HASH_NR];

} udp_port_t;

up_flags, up_state: state of port table element.

up_ipfd: index in to fd table of ip protocol.

up_minor: corresponding minor number.

up_ipdev:  index in to port table of ip protocol.

up_wr_pack:  packet to write.

up_ipaddr: ip address if the ip address is set for the associated ip device/interface.

up_next_fd:  first element in the fd table to start with when inet tries to send all unsent packets in restart_write_port.

up_write_fd:  current element in the fd table which inet is trying to write for udp.

up_port_any:  head of linked list of fds which receive messages on any local port.

udp_fd:  hash table of fds which receive messages on a particular local port.

The structure of the elements of the fd table udp_fd_table are defined in udp.c as

typedef struct udp_fd

        int uf_flags;

        udp_port_t *uf_port;

        ioreq_t uf_ioreq;

        int uf_srfd;

        nwio_udpopt_t uf_udpopt;

        get_userdata_t uf_get_userdata;

        put_userdata_t uf_put_userdata;

        acc_t *uf_rdbuf_head;

        acc_t *uf_rdbuf_tail;

        size_t uf_rd_count;

        size_t uf_wr_count;

        time_t uf_exp_tim;

        struct udp_fd *uf_port_next;

} udp_fd_t;

uf_flags:  state of fd table element.

uf_port:  points to the port table entry in udp_port_table used by fd.

ioreq:  current ioctl request.

uf_srfd:  index to sr fd table in sr.c.

uf_udpopt:  nwio_udpopt_t structure which stores configuration information.  For more details look at ip.4 manual.

uf_get_userdata:  pointer to sr_get_userdata in sr.c.

uf_put_userdata:  pointer to sr_put_userdata in sr.c.

uf_rdbuf_head:  head of queue to received packets which are waiting to be read.

uf_rdbuf_tail:  tail of queue to received packets which are waiting to be read.

uf_rd_count:  set in the read system call.  Declares how many bytes to be read to buffer.

uf_wr_count:  set in the write system call.  Declares how many bytes to be written from buffer.

uf_exp_tim:  expiration time of packet.  If the expiration time is before time that packet is read, the packet is dumped and not read.

uf_port_next:  next pointer in linked list or hashtable of currently used port table entries.

udp_init():  does initialization.  Initializes udp port tables so that uc_minor is the appropriate minor device number and uc_port is the index to the appropriate ip port table entry (ie network interface).  It calls sr_add_minor to add itself with the appropriate minor device number in to the sr fd table.  sr_add_minor sets the fields of the sr_fd structure to the udp specific values and functions except for the fields srf_flags and srf_port.  srf_flags is used to store the state of the sr_fd structure.  It also calls udp_main to do some intialization.  When udp_main is called in udp_init, udp_port->up_state is UPS_EMPTY.

udp_main():  When the up_state is UPS_EMPTY, udp_main does some initialization.  First ip_open is called (with the appropriate ip port table index) to initialize the ip fd table entry.  ip_open sets the function pointers of the ip_fd_t structure to the udp specific functions.  ip_open also sets the if_srfd field of the structure to the index of the udp fd table element.  ip then checksip_open then returns the index of the ip_fd_t structure in the ip fd array.  When ip_open returns the up_ipfd field element is set to the index returned by ip_open.    udp_main then does some initializing by calling ip_ioctl.

udp_main is called with the up_state = UPS_GETCONF from udp_get_data.  udp_get_data is called by the ip protocol when up_state = UPS_GETCONF during the just mentioned call to ip_ioctl.  If the ip_ioctl cannot be completed for some reason it returns a result of NW_SUSPEND and the udp port is placed in suspend mode until the ip_ioctl call can be suspended.  While completing the call, ip_ioctl calls udp_get_data when the port is in UPS_SETPROTO mode.  udp_get_data is called twice by ip_ioctl.  The first time it calls with count != 0 in order to get the nwio_ipopt_t structure (the structure acts as a parameter to the ioctl call).  When the ip_ioctl call finishes it calls udp_get_data again but this time with count = 0 and offset != 0.  offset in reality returns a result value for the ip_ioctl call.  If the ip_ioctl call can be completed (ip_ioctl returns without a NW_SUSPEND result) then the up_state should have been set to UPS_GETCONF and hence it falls through to the next case statement.

Calling udp_main with the up_state = UPS_GETCONF first unsuspends the udp port.  Next it calls ip_ioctl to get the configuration (NWIOGIPCONF).  If the ip_ioctl cannot be completed for some reason it returns a result of NW_SUSPEND and the udp port is placed in suspend mode until the ip_ioctl call can be suspended.  While completing the call, ip_ioctl calls udp_put_data when the port is in UPS_GETCONF  mode.  udp_put_data is called twice by ip_ioctl.  The first time it calls with count != 0 in order to return the nwio_ipconf structure (the structure acts as a parameter to the ioctl call) to udp.  When the ip_ioctl call finishes it calls udp_put_data again but this time with count = 0 and offset != 0.  offset in reality returns a result value for the ip_ioctl call.  If the ip_ioctl call can be completed (ip_ioctl returns without a NW_SUSPEND result) then the up_state should have been set to UPS_MAIN and hence it falls through to the next case statement.

Calling udp_main with the up_state = UPS_MAIN occurs after udp_put_data or udp_get_data is called in order to return the result of a previous read/write/ioctl call.  Calling udp_main with the up_state = UPS_MAIN first unsuspends the udp port.  Next it calls all udp fds which have had their calls suspended.  Finally it attempts to restart reading ip packets by calling read_ip_packets.

udp_open is called in sr.c when an open system call is perform on a udp device.  udp_open returns an index in to the udp fd table for the udp/ip channel.   udp_open marks a upd fd as used, does some initialization,  and sets the appropriate functions to get and put messages from and to a sr fd.  It returns the index to the udp fd table for the udp fd element.

udp_get_data is called by the ip protocol.  It is called with count != 0 to get the data (usually passed by the user eg a write system call) to the inet process.  It is called with count = 0 to return an integer result (eg the integer result in a write sytem call).

udp_get_data is called in mode UPS_SETPROTO when udp_init is called.  udp_init configures the corresponding ip fd element by calling ip_ioctl with NWIOSIPOPT as a parameter.  udp_get_data returns a nwio_ipopt structure to ip_ioctl with the protocol type field (nwio_proto) of the structure set to the udp protocol (IPPROTO_UDP).  nwio_proto  needs to be set to IPPROTO_UDP because ip determines which protocol to send arrived messages by looking at the nwio_proto field.

udp_get_data is called in mode UPS_MAIN after udp is configured so that ip should be performing a write/ioctl.  If called with count != 0 it returns data (usually passed by the user).  If called with count = 0 it returns an integer result (eg the integer result in a write system call).  After the integer result is passed, udp_get_data checks if the port is susupended on a write call.  If so it attempts to rewrite the message stored in up_write_fd.  up_write_fd stores the first fd which has a suspended write call.  If there is still a suspended write calls it attempts to restart them by calling udp_restart_write_port.

udp_put_data is called by the ip protocol.  It is called with count != 0 to send the data (usually passed by the user eg a read system call) to udp and often to the user process.  It is called with count = 0 to return an integer result (eg the integer result in a read sytem call).

udp_put_data is called in mode UPS_GETCONF when udp_init is called.  udp_init gets the configuration of the corresponding ip fd element by calling ip_ioctl with NWIOGIPCONF as a parameter.  udp_put_data returns a nwio_ipconf structure and udp sets the up_ipaddr field to the ip address of the associated network interface.

udp_put_data is called in mode UPS_MAIN after udp is configured so that ip should be performing a read/ ioctl.  If called with count != 0 it returns data back to udp and also often the user (ie a read system call).  If called with count = 0 it returns an integer result (eg the integer result in a read system call).  After the integer result is passed, udp_put_data checks if the port is susupended on a read call.  If so it attempts to restart reading the messages by calling read_ip_packets.

udp_ioctl implements the ioctl system call on udp devices.  udp_ioctl saves the ioctl call in uf_ioreq in case the call cannot be completed and the fd is suspended.

udp_setopt implements the ioctl system call with parameter  NWIOSUDPOPT on udp devices.  Details of the NWIOSUDPOPT are given in the ip.4 manual.  If NWUO_LP_SEL is chosen as the method of choosing the local port, inet is requested to pick a local port.  A local port is picked by calling find_unused_port.  If two different udp fds have the exact same local port then the two fds also must have the exact access modes.  Access modes are checked with the mask NWUO_ACC_MASK.

find_unused_port returns an unused port.

reply_thr_put returns the integer result for a call to uf_put_userdata.

reply_thr_get returns the integer result for a call to uf_ get_userdata.

is_unused_port determines whether a port is being used by looking through the udp fd table.

read_ip_packets reads all packets which are in the ip queue waiting to be read.  read_ip_packets needs to be called at the beginning and after udp_put_data has been called in order to ensure that packets in the queue are read.

udp_read checks to see if there is a received packet in the uf_rdbuf_head linked list.  If there is it returns the packet to the user process by calling udp_packet2user.  If there is not it returns a NW_SUSPEND which suspends the user process.

udp_packet2user sends the packet uf_rdbuf_head to the user process.  If NWUO_RWDATONLY  is set then the udp_io_hdr_t header and any ip options following the header is not returned at the start of the udp packet.  The new packet is returned adjusting for the size as stored in uf_rd_count which was set by the read system call.  The packet is then sent back to the user by calling uf_put_userdata, and then an integer result is passed backed to the user by calling uf_put_userdata again.  If the size requested in the read system call is greater than or equal to the size of the packet, the size of the returned packet is returned as the integer result; else (ie size requested in the read system call is less than the size of the packet) the integer result returned is EPACKSIZE.

udp_ip_arrived is called by the ip protocol and by udp_put_data to choose which fd to send the packet to.  After udp_ip_arrived has chosen, udp_ip_arrived then sends the packet to the user by calling udp_packet2user.  The ip header is automatically deleted from the return packet by calling bf_delhead.  Next there is a checksum check.  Next it checks over the ip options from the ip header.  If there are ip options the ip options are placed in the packet after the udp_io_hdr and the length of the ip options is stored in the uih_ip_opt_len field of the udp_io_hdr header.

udp_ip_arrived next choose which fd(s) to send the packet to.  It does this by looking at the local port number and the remote port number of the fd.  It first looks at the linked list whose head is pointed to by the up_port_any field of the port table entry.  up_port_any is the head of a linked list of udp fds which receive messages on any local port.  Then it looks at up_port_hash which is a  hash table of fds which receive messages on a specific local port.  If the udp fd has the proper credentials (proper local port, remote port, etc.), udp_rd_enqueue is called to put the packet in the queue of the fd.  If the user process with the file descriptor fd is suspended on a read at the time udp_packet2user is called to pass the packet to the user.

udp_close implements the close system call on a udp file descriptor fd.

udp_write implements the write system call on a udp file descriptor fd.  It calls restart_write_fd to attempt to write the packet.  If it does not succeed it suspends the fd.

restart_write_fd  attempts to write a packet on a file descriptor fd.  If the udp port is currently busy already trying to write a packet, restart_write_fd marks the port as having more to write and returns;  else it gets the data it needs to write (by calling uf_get_userdata).  If NWUO_RWDATALL is set, the udp_io_hdr_t header is part of the data packet sent to the inet process.  restart_write_fd then calls ip_write to write the packet.  If ip_write returns a NW_SUSPEND udp suspends the file descriptor; else udp returns the result (usually the write system call) by calling reply_thr_get.

pack_oneCsum is used to compute the checksum in the udp header.

udp_restart_write_port is called to restart writing all suspended packets on a specific port.  UPF_MORE2WRITE is set when udp attempts to write a packet but fails because the port is currently attempting to write another packet.

udp_cancel is called to cancel a read/write/ioctl.

udp_buffree is used by buf.c to release buffers.

udp_rd_enqueue is called by udp_ip_arrived to queue a received message.

hash_fd puts the upd_fd in the hash table up_port_hash (if a local port has been selected) or in the linked list up_port_any (if a local port has not been selected) of the port table (uf_port).  up_port_hash points to the hash table of fds which have selected local ports.  up_port_any points to the head of the linked list which have not selected local ports.

unhash_fd removes the upd_fd from the hash table up_port_hash or linked list up_port_any of the port table (uf_port).

ip.c

ip_init() is called in inet.c to perform initialization.  ip_init initializes the ip_ass_table.  The ip_ass_table is used to reassemble fragmented packets which are parts of the same datagram.  ip_init calls either the function ipeth_init or ipps_init to initialize the functions for the array of ip_port_t  structures (the table is called ip_port) which it uses to send messages and set ip addresses ie ip_dev_set_ipaddr, ip_dev_send, ip_dev_main.  ip_port_t is defined in ip_int.h.

The ip_port_table is the data structure which stores information for each network interface.  There is a 1-1 correspondence between the network interfaces and the elements of the ip_port_table.

ip_init initializes the ip_frame_id field with the current time.  The ip_frame_id is used as the datagram id number for the outgoing datagram.  ip_init then calls sr_add_minor to so that the user call ip directly in a system call through busing ip’s minor device number.  Finally, ip_init calls ip_dev_main to perform some initialization.  ip_dev_main is the network interface specific initialization.

ip_open() is used in 2 differene ways: 1) by sr.c to get protocol specific information for the ip channel if the ip device was opened directly by the user ie (eg by calling Open(“/dev/ip”, whatever_mode_you_want) 2)  by another protocol using the ip protocol (eg udp) calling it eg ip_open being called in udp_main by the udp protocol.  ip_open returns an index in to the ip fd table array for the ip channel.   ip_open marks a upd fd as used, does some initialization,  and sets the appropriate functions to get and put messages from and to a sr fd.  It returns the index to the ip fd table for the ip fd_t element.

ip _close implements the close system call on a ip file descriptor fd.

ip_buffree is used by buf.c to release buffers.

ip_read.c

ip_read()  is called in 2 ways.  1) ip_read is called by a layer in order to read a packet.  For example read_ip_packets attempts to read a packet by calling ip_read.  2)  If a user process directly opens the ip device (eg by calling Open(“/dev/ip”, whatever_mode_you_want) then ip_read is called by the corresponding file descriptor in the sr_fd_table in sr.c.  ip_read() is very similar to udp_read().

ip_read checks to see if there is a received packet in the if_rdbuf_head linked list.  If there is it returns the packet to the user process by calling packet2user.  If there is not it returns a NW_SUSPEND which suspends the user process.

ip_read first checks to see if the ip_fd_table element has been properly configured.  This is checked by testing if the IFF_OPTSET flag is set.

ip_arrived is called by the ethernet code or the psip code to inform ip that a new packet has just arrived.

ip_arrived calls ip_frag_chk to check the checksum and the header lengths.  Next it checks the destination ip address of the packet.

If the ip address of the packet is the same as the ip address of the port ip_port then ip_arrived calls ip_port_arrived to notify the port that a packet has arrived for the port.  Next ip_arrived checks to see if the ip address is a broadcast address by calling broadcast_dst.  If it is, it calls ip_port_arrive.

Next ip_arrived decrements the time to live (ih_ttl) field of the ip packet.  Next ip_arrived calculates the checksum field (ih_hdr_chk) of the ip header by calling ip_hdr_chksum.  ip_nettype returns the class type of the ip address eg Class A, Class B, etc.  The possible class types of an ip address is defined in an rfc.  If the class type is not A, B, or C, then the packet is discarded.

Next it looks up the destination in the route table of the port to look up to where ip_arrived should route the packet by calling iroute_frag.  Suppose iroute_frag tells ip_arrived to route it to a different port ie network interface.  If the route table entry has a gateway for that port then ip_arrived sends the packet to the gateway.  If the destination is the same as the ip address of the other port then ip_arrived informs the other port a new packet has arrived by calling ip_port_arrive.    Next ip_arrived makes checks to see if the ip address is a broadcast address.  Finally ip_arrived sends it to the other port.

Next ip_arrived checks to see if there’s a gateway on the port.  If not it throws an error.

broadcast_dst returns 1 if the destination ip address dest is a broadcast address with respect to the port ip_port; else broadcast_dst returns 0.  broadcast_dst looks at the net mask and subnet mask of ip_port to determine if dest is a broadcast address.

ip_port_arrive() notifies a port that a packet has arrived on the port (ie network interface).  First it checks to see if the received packet is using fragementing so that its merely one packet in a fragmented datagram.  It checks by checking the flags

if (ntohs(ip_hdr->ih_flags_fragoff) & (IH_FRAGOFF_MASK|IH_MORE_FRAGS))

If it is then ip_port_arrive calls reassemble to merge the packet with any other packet from the same datagram.  If  some of the packets of the datagram still haven’t arrived then reassemble() returns a null and the function ip_port_arrive is exited by calling return.

ip_port_arrive is very much like udp_ip_arrive.  ip_port_arrive chooses which fd(s) to send the packet to.  It does this by looking at the protocol field of the fd ie if_ipopt.nwio_proto.  It first looks at the linked list whose head is pointed to by the ip_proto_any field of the port table entry.  ip_proto_any is the head of a linked list of udp fds which receive messages on any local protocol.  Then it looks at ip_proto which is a  hash table of fds which receive messages on a specific protocol.  If the udp fd is using the correct protocol, packet2user is called to put the packet in the queue of the fd.

packet2user tries to send a packet to a layer or a user.  If nothing is trying to read on the file descriptor fd at the time, then the packet is placed in the queue of the ip fd where the head of the queue is pointed to by if_rdbuf_head.  If something is trying to read on the file descriptor fd, packet2user sends it to the user or layer trying to read on the file descriptor fd.  If NWIO_RWDATONLY is set then the ip header is removed from the returned packet.  The new packet is returned adjusting for the size as stored in if_rd_count which was set by the read system call.  The packet is then sent back to the user by either calling if_put_pkt or if_put_userdata.  if_put_pkt (when defined) is used to return received packets to the user – as in a read system call.  if_put_userdata is used to return data from ioctl system calls.  If if_put_pkt is defined the data is returned with the transfer size as a parameter; else if if_put_pkt is not defined, if_put_pkt is called twice in a way very similar to how udp_packet2user calls uf_put_userdata twice.  The packet is then sent back to the user by calling if_put_userdata, and then an integer result is passed backed to the user by calling if_put_userdata again.  If the size requested in the read system call is greater than or equal to the size of the packet, the size of the returned packet is returned as the integer result; else (ie size requested in the read system call is less than the size of the packet) the integer result returned is EPACKSIZE.

ip_frag_chk checks the checksum and the header lengths and does other validations of the ip header.

ip_arrived_broadcast – called by the network interface (psip or ethernet code) when a broadcast message has arrived.  ip_arrived_broadcast calls ip_frag_chk to check the checksum and the header lengths.  Next it checks the destination ip address of the packet.  Next it checks if  to see if the ip address is a broadcast address by calling broadcast_dst.  If it is not, it throws an errors an exits the function; else ip_arrived_broadcast calls ip_port_arrive.

Fragmentation is defined in the ip protocol.  Check on the appropriate RFC.  The data structure which handles a fragmented datagram is the ip_ass_t structure.

typedef struct ip_ass

        acc_t *ia_frags;

        int ia_min_ttl;

        ip_port_t *ia_port;

        time_t ia_first_time;

        ipaddr_t ia_srcaddr, ia_dstaddr;

        int ia_proto, ia_id;

} ip_ass_t;

ia_frags:  points to a packet which is the head of linked list of packets.  The linked list is a list of packets which are part of the same datagram.

ia_min_ttl:  time to live for datagram.  If (ia_first_time + ia_min_ttl) < (time when every packet for the packet has arrived - and hence can be reassembled) the datagram is dumped.

ia_port:  pointer to port in ip_port_table.

ia_first_time:  initialized to 0.  Time when first packet for the datagram arrived.

ia_srcaddr:  source ip address for packet(s) of datagram.

ia_dstaddr:  destination ip address for packet(s) of datagram.

ia_proto:  protocol of packet(s) of datagram.

ia_id:  datagram id number for datagram.  Every packet for the same datagram should have the same datagram id number.

reassemble():  reassemble attempts to reassemble a defragmented datagram which has been received in several different packets.  First reassemble calls find_ass_ent to return a ip_ass_t structure.  If no packet  for the datagram had previously arrived a new ip_ass_t structure is returned by find_ass_ent; else if a packet  for the datagram has already arrived then the previously used ip_ass_t structure is returned by find_ass_ent.  Next if the packet just received is first packet received from the datagram, reassemble returns null.  Then reassembles merges all the packets received for the datagram by calling merge_frags.  If all of the packets for the datagram have all been received the completed datagram in returned by reassemble; else reassemble  returns a null.

find_ass_ent():  find_ass_ent returns a ip_ass_t structure for a fragmented datagram.  The ip_ass_t structure is held in the ip_ass_table array.  First find_ass_ent looks through the ip_ass_table array.  If a packet of the datagram has already been received find_ass_ent returns the ip_ass_t structure which is already being used to reassemble the datagram; else it uses a new ip_ass_t structure for it.

merge_frags(first, second):  merge_frags merges two acc_t structures first and second so that the data in the first structure comes before the data in the second structure.

ip_process_loopb():

ip_ioctl.c

ip_ioctl implements the ioctl system call on a ip device.  A file descriptor to an ip device is obtained by using the open system call eg Open(“/dev/ip”, whatever_mode_you_want).  The general description of what ioctl does depending on the request parameter are given in the ip.4 manual and therefore will not be covered here.  The request parameter is the req parameter in the ip_ioctl function.

reply_thr_get returns an integer result back to the user.  It does this by calling if_get_userdata.

As mentioned before, for an ioctl call, you can return an integer result to the user either by calling sr_put_userdata(x, y, 0, 1) or sr_get_userdata(x, y, 0, 1).  Since the ip_ioctl function is only called by sr.c and not by a layer/protocol (eg udp) , if_get_userdata/ if_put_userdata is always a pointer to the functions

sr_get_userdata /sr_put_userdata in sr.c.  Hence it doesn’t matter if we return an integer result by calling if_get_userdata or if_put_userdata.  An integer result is returned to the user by calling sr_get_userdata in ip_ioctl.c by calling reply_thr_get.

ip_ioctl implements the system call in a way very similar to how udp_ioctl implements the ioctl system call.  ip_ioctl calls if_get_userdata to get the structure passed in the ioctl system call.

ip_ioctl is called with a request of NWIOSIPOPT to configure a ip channel ie a file descriptor which uses the ip protocol.  In that case, ip_checkopt is called by ip_ioctl.  ip_checkopt checks to make sure that the the configuration settings passed by the user are valid.  If NWIO_HDR_O_SPEC is passed as a flag in the ioctl call the ip header options are checked by calling ip_chk_hdropt.

ip_checkopt calls ip_hash_proto if the proposed configuration is valid.

ip_hash_proto() puts the ip_fd in the hash table ip_proto (if a protocol has been selected) or in the linked list ip_proto_any (if a protocol has not been selected) of the port table (if_port).  ip_proto points to the hash table of fds which have selected a protocol.  ip_proto_any points to the head of the linked list which have not selected a protocol.

ip_unhash_proto removes the ip_fd from the hash table ip_proto or linked list ip_proto_any of the port table (if_port).

typedef struct ip_port

        int ip_flags, ip_dl_type;

        union

                struct

                    int de_state;

                    int de_flags;

                    int de_port;

                    int de_fd;

                    acc_t *de_frame;

                    acc_t *de_q_head;

                    acc_t *de_q_tail;

                    acc_t *de_arp_head;

                    acc_t *de_arp_tail;

                } dl_eth;

                struct

                    int ps_port;

                    acc_t *ps_send_head;

                    acc_t *ps_send_tail;

                } dl_ps;

        } ip_dl;

        int ip_minor;

        ipaddr_t ip_ipaddr;

        ipaddr_t ip_netmask;

        ipaddr_t ip_subnetmask;

        u16_t ip_frame_id;

        u16_t ip_mss;

        ip_dev_t ip_dev_main;

        ip_dev_t ip_dev_set_ipaddr;

        ip_dev_send_t ip_dev_send;

        acc_t *ip_loopb_head;

        acc_t *ip_loopb_tail;

        event_t ip_loopb_event;

        struct ip_fd *ip_proto_any;

        struct ip_fd *ip_proto[IP_PROTO_HASH_NR];

} ip_port_t;

ip_minor:  minor device number of ip device.

ip_ipaddr:  ip address of the associated network interface.

ip_netmask:  net mask of the ip address.  The net mask of the ip address is determined from the class type (class A, B, C, D, E) of the ip address.  Which class type determines which net mask if defined in the proper RFC.

ip_subnetmask:  subnet mask of the ip address.  Subnet mask define the subnet which the network interface belong to.

ip_frame_id:  used as the datagram id number for an outgoing datagram.

ip_mss:  maximum segment size - maximum packet size of packet.

ip_dev_main:  set by network interface to the appropriate function which does some initialization and other things.

ip_dev_set_ipaddr:  pointer to a function which is called when the ip address is set.  If any calls are waiting for an ip address the calls are returned.

ip_dev_send:  pointer to a function which is called when writing a packet to be sent.

ip_loopb_head:

ip_loopb_event:

ip_proto_any:  head of linked list of fds which receive messages on any protocol.

ip_proto:  hash table of fds which receive messages on a particular port.

ip_write.c

ip_write():  ip_write is called to write a packet to a network interface.   ip_ ip_write ()  is called in 2 ways.  1) ip_write is called by a layer in order to write a packet.  For example read_ip_packets attempts to read a packet by calling ip_write.  2)  If a user process directly opens the ip device (eg by calling Open(“/dev/ip”, whatever_mode_you_want) then ip_write is called by the corresponding file descriptor in the sr_fd_table in sr.c.  ip_write() is very similar to udp_write().

ip_write() first checks if the requested bytes to write is greater than IP_MAX_PACKSIZE.  If it is, then a EPACKSIZE error is returned.  Next ip_write() gets the packet to write by calling if_get_userdata.  Finally ip_write() calls ip_send to send the packet.

ip_send() first checks if the ip address has been set by checking if the IPF_ADDRSET flag has been set.  Next ip_send checks if NWIO_RWDTONLY is set.  NWIO_RWDTONLY is set if the packet sent to ip does not include the ip header.  If NWIO_RWDTONLY is set, the ip header is added by ip.

Next ip_send checks if NWIO_HDR_O_SPEC is set.  NWIO_HDR_O_SPEC specifies all IP header options in advance.  IP option headers are passed in the iho_data field of the ip_hdropt structure.  There’s a field which is a ip_hdropt structure in the nwio_ipopt structure.  The nwio_ipopt structure is passed in the ioctl system call when one tries to configure a ip channel with a request of NWIOSIPOPT.  Then length of ip option headers is given in the iho_opt_size field of the ip_hdropt structure.  If NWIO_HDR_O_SPEC is set, the ip option headers are added to the packet.

Bug in ip_send?  It looks like if NWIO_HDR_O_SPEC is not set and NWIO_RWDTONLY is set then ip_send errors because ip_hdr->ih_ttl = 0 and ip_send returns a value of EINVAL.

ip_send then sets the ip version, the length of the packet, the fragmentation flags, the datagram id number, the ip source address, and the ip destination address.  If the remote address has not already been specified by the ioctl call with request NWIOSIPOPT, then the destination address provided by the caller is checked for validation by calling chk_destaddr.  Next ip_send calculates the checksum field (ih_hdr_chk) of the ip header by calling ip_hdr_chksum.

ip_send then checks the destination ip address.  If the destination ip address begins with 127 then the packet is sending a message to its own network interface ie the destination ip address is sending to the local loopback.  Local loopback packets are put in the loopback queue.  Each ip port has its own loopback queue.   The head of the loopback queue is pointed to by ip_loopb_head while the tail of the loopback queue is pointed to by ip_loopb_tail.  The loopback queue is processed by ev_process.  ev_process is called in the main function in inet.c.

If the destination ip address is the broadcast address 255.255.255.255, then ip_send sends it by calling ip_dev_send.  If the destination ip address is the same as the ip address of the network interface, the packet is also put in the loopback queue.

Next ip_send checks to see if the destination ip address is on the same subnet as the network interface.  If it is then it sends it to its destination by calling ip_dev_send.

ipr.c

ipr.c is the code for routing of packets.  Packets are routed in ip_read.c and ip_write.c.  Packets needed to be when the inet process needs to find out to which address it should route a packet in order for the packet to arrive at its destination.  The route tables store the information which is used to determine to where a packet should be routed.  There are 2 route tables used in inet:  1) oroute_table and 2) iroute_table.  oroute_table handles the routing for outgoing packets.  iroute_table handles the routing for incoming packets.  oroute_table is an array of  oroute_t structures while iroute_table  is a structure of iroute_t structures.  The oroute_t structure and the iroute_t structure are both defined in ipr.h.  The iroute_t and oroute_t structure are very similar.  Each element in the arrays whether it is a iroute_t or oroute_t structure represents a route in the route table.  Recently used routes are strored in the iroute_hash_table and the oroute_hash_table.  iroute_hash_table handles the routing for incoming packets while the oroute_hash_table handles the routing for outgoing packets.

typedef struct iroute

        ipaddr_t irt_dest;

        ipaddr_t irt_gateway;

        ipaddr_t irt_subnetmask;

        int irt_dist;

        int irt_port;

        int irt_flags;

} iroute_t;

irt_dest:  the destination ip address for the route.  This ip address typically represents a subnet or network.  Therefore the last several digits of the ip address would be zero.  This route is used only if the destination address of the packet belongs to the subnet or network represented by the ip address irt_dest.

irt_gateway:  ip address to route all packets which use this route.

irt_subnetmask:  subnet mask of route.  Used with irt_dest to determine whether or not a packet should use this route.

irt_dist:  route distance.  Used to determine which route is best.

irt_port:  index in to ip_port table.  This route is used only for the ip channel which corresponds to the element in the ip_port_table.

irt_flags:  flags to store information about route.

typedef struct oroute

        int ort_port;

        ipaddr_t ort_dest;

        ipaddr_t ort_subnetmask;

        int ort_dist;

        i32_t ort_pref;

        ipaddr_t ort_gateway;

        time_t ort_exp_tim;

        time_t ort_timestamp;

        int ort_flags;

        struct oroute *ort_nextnw;

        struct oroute *ort_nextgw;

        struct oroute *ort_nextdist;

} oroute_t;

ort_dest:  the destination ip address for the route.  This ip address typically represents a subnet or network.  Therefore the last several digits of the ip address would be zero.  This route is used only if the destination address of the packet belongs to the subnet or network represented by the ip address ort_dest.

ort_gateway:  ip address to route all packets which use this route.

ort_subnetmask:  subnet mask of route.  Used with ort_dest to determine whether or not a packet should use this route.

ort_dist:  route distance.  Used to determine which route is best.

ort_pref:  used to determine which route is best(???).  Not certain.

ort_port:  index in to ip_port table.  This route is used only for the ip channel which corresponds to the element in the ip_port_table.

ort_flags:  flags to store information about route.

ort_exp_tim:  expiration time for route.  No longer used if current time is later than ort_exp_tim.  Set to 0 if route has no expiration time.

ort_timestamp:  time that the route was added.

ort_nextnw:  head of linked list of route entries which are being used.

ort_nextgw:  head of linked list of route entries which are being used and have the same ort_port, ort_dest, and ort_subnetmask.

ort_nextdist:  head of linked list of route entries which are being used and have the same ort_port, ort_dest, ort_subnetmask, and ort_gateway.

ipr_init():  initialization called in main function of inet.c.  Initializes the oroute_table which handles the routing for outgoing packets and the iroute_table which handles the routing for incoming packets.

iroute_frag():  returns a pointer to the route table entry (iroute_t) given the ip channel and the destination of the packet for an incoming packet.  First iroute_frag checks if the address was stored in iroute_hash_table which stores recently used routes.  If the address was stored in iroute_hash_table it returns the corresponding route table entry; else iroute_frag checks the array iroute_table for a route which can used for the destination ip address on the ip channel.  A route can be used only if the route is being used on the same ip channel as requested and the route belongs on the same subnet as the destination ip address.  This is tested by the call

if (((dest ^ iroute->irt_dest) & iroute->irt_subnetmask) != 0)

which returns 0 if it does belong on the same subnet.  Finally if a route is found, iroute_frag adds the route to the hash table iroute_hash_table and returns it.

oroute_frag():oroute_frag  does the same thing as iroute_frag only for outgoing packets.  It does this by calling oroute_find_ent.

ipr_add_iroute():  ipr_add_iroute adds a route to the iroute_table which handles incoming packets.  Static routes are always added as new routes.  Non-static routes modify old routes if an old route has the same ip channel, destination address, subnet mask, and gateway.

ipr_add_oroute():  ipr_add_oroute adds a route to the oroute_table which handles outgoing packets.  First it checks if the gateway is on the same subnet as the destination address.  Next it checks if a static route is being added.  Static routes are always added as new routes.  Non-static routes modify old routes only if an old route has the same ip channel, destination address, subnet mask, gateway, distance, and preference

Routes for outgoing packets which are being used are put in linked lists.  out_nextnw, out_nextgw, out_nextdist are used to point to the next member of the linked lists.  oroute_head is the head of the main linked list.  Each member of the list have distinct ort_port, ort_dest, ort_subnetmask ordered triplets.  Each of the members of the linked list headed by oroute_head is itself a head of a linked list which all have the same ort_port, ort_dest, and ort_subnetmask.  Each member of that linked list is itself a head of a linked list which all have same ort_port, ort_dest, ort_subnetmask, and gateway.

If no route in oroute_table is found which matches a new route is added.  The new route is added by placing the new route in the correct linked list.

ipr_gateway_down(): ipr_gateway_down is called to notify the routing table that a gateway is down.

ipr_destunrch(): ipr_destunrch is called to notify the routing table that a destination is unreachable.

ipr_ttl_exc():ipr_ttl_exc is called to notify the routing table that a destination has received a ttl exceeded.

ipr_get_oroute():  ipr_get_oroute gets the route table entry from the oroute_table array.

oroute_find_ent():  returns a pointer to the route table entry (oroute_t) given the ip channel and the destination of the packet for an incoming packet.  First oroute_find_ent checks if the address was stored in oroute_hash_table which stores recently used routes.  If the address was stored in oroute_hash_table it returns the corresponding route table entry; else oroute_find_ent checks the array oroute_table for a route which can used for the destination ip address on the ip channel.  A route can be used only if the route is being used on the same ip channel as requested and the route belongs on the same subnet as the destination ip address.  This is tested by the call

if (((dest ^ oroute->ort_dest) & oroute->ort_subnetmask) != 0)

which returns 0 if it does belong on the same subnet.  Finally if a route is found, oroute_find_ent adds the route to the hash table oroute_hash_table and returns it.

oroute_del():  oroute_del deletes a route for outgoing packets.  First it finds the route in the linked list of routes.  Next it removes it.  Finally it sorts the remaining routes in the linked list.

sort_dists():  sort_dists sorts a linked list on the ort_dist and ort_dest field where ort_nextdist points to the next member of the list.  The sort simply places the best element at the head of the list and returns it as the return value.

sort_gws():  sort_gws sorts a linked list on the ort_dist and ort_dest field where ort_nextdist points to the next member of the list.  The sort simply places the best element at the head of the list and returns it as the return value.

oroute_uncache_nw():  oroute_uncache_nw zeros all of the entries in the oroute_hash table for the subnet represented by the dest ip address and the netmsk subnetmask.  oroute_uncache_nw is called to remove a route from the cache ie hash table.

ipr_get_iroute ():  ipr_get_iroute gets the route table entry from the iroute_table array.

ipr_del_iroute():  ipr_del_iroute deletes a route for incoming packets.  First it finds the route in the iroute_table.  Next it removes it.  Finally it uncaches it from the hash table.

iroute_uncache_nw():  iroute_uncache_nw zeros all of the entries in the iroute_hash table for the subnet represented by the dest ip address and the netmsk subnetmask.  iroute_uncache_nw is called to remove a route from the cache ie hash table.