If the control and forwarding functions in a router can be separated independently, it is possible to maintain a router’s data forwarding capability intact while the router’s control software is restarted/reloaded. This functionality is termed as “graceful restart” or “nonstop forwarding”. The router’s control software (the routing protocols and the signalling protocols) can stop and can restart for myriad reasons – SW error crashing the protocol task, a switchover to the redundant control card, or a planned shutdown as part of the operational maintenance – the list is endless. The idea behind graceful restart is to continue forwarding packets based on the snapshot of FIB just before the router restarted. The restarting router is assumed to be capable of preserving the FIB and some amount of control information (like cryptographic sequence number in case of OSPF).
o Restarting OSPF router originates a Grace LSA (link local Opaque LSA) specifying the ‘grace period’, thereby indicating to its neighbors the time, in seconds, that the neighbors (the helpers) should continue to consider this router as fully adjacent. The helping neighbors enter into a state known as helping mode during this period. The onus falls on the helpers to detect a topological change during the grace period and acting accordingly.
o In case of a planned restart, OSPF issues a Grace LSA to its neighbors on each restarting interface and sets the value 1, which is Software restart, in the Graceful Restart Reason TLV. In case of an unplanned outage, the router first issues a Grace LSA before sending out any HELLOs. Most implementations transmit the Grace LSAs multiple times, till an acknowledgement is heard from the neighboring routers.
o The helping router continues advertising the restarting router in its LSAs and other routers in the network never come to know of this event.
o Using standard OSPF procedures the helping routers establish adjacencies with the restarting router and synchronize their LSDBs. During the grace period, the restarting router receives its own self generated pre-restart LSAs. It accepts them as valid, and does not originate type 1 through 5 and type 7 LSAs, even after it transitions to a FULL state. The restarting router can run the SPF, but its not yet allowed to update the FIB.
o Once the restarting router and its helpers have synchronized their databases within the grace period, the former flushes its grace LSAs to signal successful completion of the gracweful restart procedure. The restarting router now reoriginates its router LSAs on all attached areas and the network LSAs on the segments, where its the DR. It now schedules a full SPF, calculates the routes, and updates the FIB.
o The restarting router had marked all the routes in FIB as stale before sending out the Grace LSAs. After graceful restart is over and it has recalculated the routes, it deletes all the routes marked as stale in the FIB. It can now reoriginate summary LSAs, type 7 LSAs and AS External LSAs as appropriate.
o When the helpers receive the flushes Grace LSAs, they exit the helper mode and revert back to normal OSPF procedures.
o OSPF automatically reverts back to standard OSPF restart from graceful restart if topological changes are detected or if one or more of the restarting router’s neighbors do not support graceful restart.
o More details here.
o Restarting router does not re-compute its own routes until it has achieved database synchronization with its neighbors.
o Uses restart TLV (type 211) in its IIH to obtain the graceful restart functionality. Grace period is decided as the minimum of the Remaining times of received IIHs containing a restart TLV with RA bit set. Upon receiving this IIH, the helping routers would flood the complete sets of CSNPs onto the link and set the SRM flag on all its LSPs towards the restarting router.
o The restarting router can now clear its restart timer, since it has some helping routers that can provide a full set of link state information through the normal transmission and retransmission process.
o During grace period, restarting router does not transmit self-originated LSPs and self-LSPs are not purged or modified. These restrictions are necessary to prevent premature removal of an own LSP and hence churn in other routers.
o Restart mechanism in IS-IS allows to establish adjacency without cycling through the normal operation of adjacency state machine.
o If a timer on the restarting router expires before it recieves a full set of CSNPs from its helpers, the adjacency is reset, and a normal neighbor restart is attempted.
o Usual database synchronization is achieved in situations where the neighboring routers of the restarting router do not support the restart TLV.
o More details here.