4        A comparative analysis of OAM mechanisms

 

4.1    Introduction

To achieve a greater understanding of the different OAM mechanisms on MPLS, it is necessary to compare the MPLS OAM mechanisms to the existing mechanisms at the IP layer and discuss their properties. This chapter has been divided into subchapters to emphasize the various advantages and disadvantages of the OAM mechanisms.

 

While MPLS might be a supplement to the IP layer, one must not forget that the Internet Protocol has not been designed for predefined paths or similar. It was developed for the most extreme environments, such as one might meet in a war situation. It could be hard to set up a MPLS network under such extreme situations, whilst setting up IP would have been much easier. This might justify that OAM mechanisms on IP is relatively limited in respect to all MPLS OAM mechanisms.

4.2    Failure detection

When routers detect link failures or node failures, it has to report its failure detection to other affected routers or hosts. This is done for both IP and MPLS, but there are significant distinctions between failure handling in MPLS and IP:

 

IP by it self does not have any failure message distribution; it is using the protocol ICMP to tell the source host about failures. This way, the source host will retrieve information about both link layer and node failure. The failures deal with properties that are important for how IP packets are forwarded towards their destination. When the host gets the failure message, it can do nothing else than register it as a failure and inform layers above, if it is not connected to two different routers. The messages that routers are sending between each other when failures occurs, and how they alter their routing tables to avoid failure areas in their network, is not a scope of this thesis.

 

As it is for IP, the source hosts are the ones that are informed about failures. The reason for this is that the node that is responsible for the traffic need to know that its packets are lost somewhere on the way to their destination. In MPLS, it is also necessary to inform nodes about failures, but this information does not reach the sending source host. This is because MPLS mainly is used in large backbone networks for the time being, thus one has no hosts connected to it. Therefore, it is only necessary to inform the sink points of the failed LSP in the MPLS network. These sink points will then perform the necessary tasks like rerouting around the area affected.

 

MPLS has also failure detection mechanisms and different messages to send to routers that have the need for and can have mechanisms for handling LSP failure information. There are several solutions proposed, but none of them are fully specified and approved at the time this thesis was written. Some of the solutions are not fully incorporated into MPLS, but most of them do. Hence, MPLS is not dependent to a protocol outside itself as IP is. The proposal from ITU-T (Y.1711) detects both MPLS layer and node failures, and contains so far six different message types and six different types of failure detections. It is up to the operator to design which of the message types he would like to use. MPLS Ping has the mechanism to detect link and node failures as well, and will be discussed in chapter 4.6. When using extension for RSVP Hello messages, it is only node failures that are detected. Failures detected by layer 2 have not been a scope for the thesis.

 

When a failure is detected by the mechanisms proposed by ITU-T, both the upstream and downstream sink points can be informed about this failure. The transit LSRs will not be informed. Both BDI and FDI message are sent upwards in the label hierarchy to suppress unnecessary failure messages on client layers. This might be a very good intention; especially considering when the client layer LSP’s sink point could be on another management domain.

 

In addition to BDI and FDI, which are two types of messages that are to be sent when failures occurs in a LSP, it is also proposed a method for determining packet and octet loss on an LSP. This mechanism is sending Performance packets in order to aid trouble-shooting among other things. This mechanism is for further study. As we can see, the intention of this method is to give the MPLS network a mechanism similar to the ad hoc mechanisms Ping and Traceroute on IP. This way this is done on IP is explained in chapter 4.6.

 

The failures that can be detected are both MPLS dependent, meaning failures in respect to LSPs, and failures caused by the underlying layer. It seems it is possible to detect all failures occurred in connection to LSPs and packet forwarding; at least the ones of greatest importance. We can not see other failures that could be necessary to detect.

Both loss of LSP connectivity, wrong merging of LSPs, and swapped LSPs are detected. An aspect, which is not covered by ITU-T’s proposal, is how the transit LSR gets to know about failures on LSPs. This is probably not of interest for MPLS in it self. Transit LSRs must at least be aware of link and node failures from lower layers or routing mechanisms in MPLS, such as of RSVP’s extension of RSVP Hello Request message.

 

To perform OAM operation on networks, it is necessary to send test packets, OAM packet, on it. This will necessarily take some of the bandwidth from work traffic. Therefore it is important to find a suitable balance between the demand of failure detection and bandwidth usage. We assume the proposal from ITU-T of sending one CV packet per second gives a tolerable load on the network, but there are probably many opinions about this. This frequency gives in some circumstances, such as for protection witching, to slow failure detection. A suggestion for solving this problem is explained in chapter 5.3.

 

An aspect to take in account for the transport of BDI packets upstream is that it might have to be transported using only common IP datagram and outside the MPLS layer. This may cause security problems and could be used by malicious users for sending of BDI packets, when actually no failure exists, and cause for example denial of service.

 

MPLS OAM functionality, as well as common MPLS forwarding, is very dependent on ingress and egress LSRs. If ingress LSR fails, the work traffic will not be sent; but even worse, if the egress LSR fails, it will neither be able to forward packets or inform the LSP’s source point about the failure and the packet will still be sent.

4.3    Reachability features

The reachability features are properties of the technologies that safely provide the sent packets to reach their destination. Both when one use IP and MPLS, the protocol standards provide no sorts of retransmission of lost packets; they are just doing their best effort in transmitting the packets. The functionality of providing better quality of transmission, like retransmission, must be performed by protocols at higher layers. But it is not always sufficient that packets reach their destination. Some applications require the packets to receive the destination within a certain time and in the correct order, or else the packets are useless and will be discarded. In such situations it is not enough to depend on higher layer protocols; one is dependent of reducing the packet loss or packet delay in layer three protocols. At this area there are some distinctions between the two technologies.

 

Firstly, the more hops there are between a source and its packets’ destination, the greater the chance is that packets will not reach the destination. IP networks forward packets towards their destination by using best effort. The routers decide the next hop of their packets individually and select the next hop by lookup in their forwarding tables. This method gives a quite good reliability in respect to delivering the packet to its destination, but it is not a guarantee for delivery within the desired time. This is likely comparative to the nature of the internet, where the packets should be sent at the network’s best effort, making the network more suitable for its purpose; to send data at any cost – hoping that they reach its destination.

 

Secondly, MPLS uses a very different function. In MPLS the routers to which the packets are traversing, are predefined along the entire path, and give a more reliable packet submission. With help from LSP failure detection and rerouting mechanisms, the packet loss will be reduced in proportion to IP.

4.4       Avoidance of congested routers

MPLS makes it possible to route around congested routers by letting an ingress LSR create a new LSP tunnel and make the network send traffic using this new LSP instead of the congested path. This is not possible using IP itself; it is, as we can see, dependent of letting other services discover the congestion and make them find an alternative route around the bottleneck. ICMP do send messages about congested routers back to the host, but the host can do nothing to route around it, unless the host itself has a next-hop alternative.

 

MPLS handles these problems differently. The failure code dServe is used when failures on lower layers than MPLS occurs. There are no specifications of which failures this code is representing, but we will assume that it can be used for signaling a congested router. The egress LSR will achieve a FDI packet containing this failure code, from let's say a transit LSR, and informs its ingress LSR by sending the BDI. The egress router would simultaneously have detected the loss of Connectivity Verification packets (CVs) due to this congestion.

 

When the ingress LSR receives BDI with the dServe codepoint, it will try to set up an LSP around to avoid the congested route. This is not covered by this thesis. But as we have learned from ongoing discussion at mailing lists, LSRs will use routing protocols like OSPF when they create LSPs. Then the routing around congested areas in MPLS and IP are alike except from the instantiation of forwarding tables. This instantiation in MPLS is performed by the technology itself while it is not concerning IP.

4.5    SNMP features

There exist Simple Network Management Protocol (SNMP) features for both MPLS and IP. The number of MPLS-specific MIBs is not yet as high as on IP, but there will probably come more MIBs supporting MPLS-specific features in the future. To access the MIBs at the routers, we have to use SNMP on both IP and MPLS. The amount of different MIBs in MPLS and IP are not competing each other in respect to each protocol; they are more like a supplement to each other.

 

The MPLS MIBs proposed by IETF is bound to MPLS functionalities. It gives us the possibility to configure and monitor different parameters concerning MPLS LSRs and the MPLS LSPs. When it comes to purposes for monitoring non-MPLS specific features, like monitoring the uptime of an LSR, there is no need to create MPLS-specific MIBs for such tasks. The existing SNMP MIBs performs well on this job, and would therefore be a preference for such features. In other words, the existing SNMP using IP provides a good solution when it comes to getting information on routers.

 

More advanced MIBs that can read or modify the configuration on LSRs might show up in the future. These would need to be MPLS-specific, unless they are programmed generically for all kinds of routers or their vendor specific.

 

It may not be so strange that the existing SNMP solutions provide a good OAM function for a network. This is both because of the period of time the existing solution has used for evolving and because of that many features wanted on routers will not need the MPLS standard for transporting information. As an example, we can mention that when one need to monitor the network, it might not be suitable to send information using MPLS, as the LSPs might have errors, or the packets might not reach their destinations because of other MPLS-specific errors.

4.6    Ping and traceroute

Both on MPLS and on IP it is possible to use the functionalities called Ping and Traceroute, but MPLS Ping and MPLS Traceroute are currently on the draft state. These mechanisms will be helpful to verify whether the node is functioning, if it is possible to reach it or where an eventual failure has occurred.

 

The MPLS Ping is greatly tied up to the MPLS architecture. This is not the case for IP ping. While IP ping is using ICMP, MPLS ping packets are restrained to follow LSPs. In difference to IP Ping, which sends both request and response packets using ICMP, MPLS Ping uses different transport mechanisms. While the request messages follow the LSP, the response messages must be sent using other transport mechanisms on their way back to the requesting host. These packets must be sent either by IP or by control plane, since it will not be convenient to create a special LSP just for sending responses.

 

MPLS ping gives the requester the one-way delay; this is in contrast to IP ping which gives a two-way delay. The one-way delay has a limitation for the retrieval of useful latency test results. This limitation is that the LSRs in an MPLS network have to be synchronized in time, and this is difficult when the Ping messages are sent between routers of a different management domain.

 

For both Ping mechanisms it is impossible to differentiate between failures in the forward direction and the return direction. Therefore these mechanisms are dependent of reliable IP forwarding mechanisms in the return direction. For MPLS Ping it is also possible to send the response through the MPLS control plane thus the MPLS control plane must be reliable throughout the network.

 

There is also another proposal that has the same intention as MPLS Ping. This proposal has two message types, Loopback Request and Loopback Response, and is an ad hoc mechanism to verify LSP endpoint and delay measurement. For the time being, these are mechanisms for further study and hence not yet specified.

 

For both technologies, there is also a mechanism for tracing routes. As for Ping, the difference is that MPLS Traceroute is restrained to follow the LSPs downstream, but have to use another way back. This difference causes the MPLS Traceroute to be much more helpful for finding the location where failures have occurred. This is because IP Traceroute could choose different ways than the one with failure, and this would make it harder to find the location of the failure. To avoid this problem in IP, one has to save a traceroute in advance, and use IP Ping on each hop in the previously saved traceroute output to locate the unreachable router.

4.7    Fast rerouting and Protection switching

It is proposed several different types of mechanisms for MPLS to enhance the availability and reliability of the MPLS network. Both ITU-T and IETF have developed mechanisms for this. Also, it exist a proposal for a mechanism that is from Lucent Technologies. ITU-T calls their mechanism protection switching while IETF calls their solution for fast rerouting. The independent mechanism can be called packet protection. Mainly these mechanisms are quite different, but there are some similarities such as calculation and allocation of backup entity pre-failure. Such a protection mechanism does not exist in IP itself.

 

Both protection switching and fast rerouting is limited to point to point LSP tunnels for time being. The protection LSP and backup segments, and resources for them, is computed before a failure has occurred. These backup entities will take over all working traffic when a failure on node and link layer is detected on the original tunnels. For the protection switching also MPLS layer failures may be detected, and even more complete, the MPLS layer failures will always be detected by the packet protection mechanism.

 

The mechanisms can be divided into two main techniques that we can call LSP protection switching and link protection switching. LSP protection switching consists of the mechanisms proposed by ITU-T and the packet protection mechanism as proposed by Lucent Technologies, drafted at IETF. Both mechanisms are switching between the working LSP and the protection LSP. The distinction between Link protection switching and LSP protection switching is that the first describes switching from failed link to a backup segment, whilst the second describes switching working traffic to a backup LSP. To simplify the discussion later on, the protection LSP and backup segment will be called backup entities.

 

Link protection switching is to switch working traffic onto a backup segment when a failure occurs on a link or node somewhere between ingress and egress LSR. The backup segment is near the failed link or node and merged onto the protected LSP downstream of the failed link. The ingress LSR will simultaneously get knowledge about the failure and it will construct a backup LSP onto which the work traffic will be switched. This is in contrast to ITU-T’s protection switching, where the protection switching occurs at ingress or egress and the whole backup LSP has been constructed in advance of failure.

 

The different types of rerouting have different types of properties. When a LSP has a protection LSP or backup segments, there will be duplicated paths and segments in the network, resulting in an increased redundancy. In most fast rerouting mechanisms these backup entities will stay unused; transporting no extra traffic as long as the original LSP is functioning well. But one of the mechanisms is different from the others; the ITU-T’s 1:1 architecture type makes it possible to forward extra traffic on the backup entity when not utilized for working traffic. There are also other rerouting mechanisms, which are something between the full LSP duplicity and ITU-T’s 1:1 architecture, and these are facility backup and shared bandwidth protection from IETF. These are more redundant than the 1:1 architecture type, but less redundant than the solution with one backup entity for each protected entity. Here, we have at least the possibility for using the same backup entity for several working entities.

 

Fast rerouting should occur almost immediately when a link or node is down. The link failure detection is dependent on layer 2 mechanisms and to detect node failure one can use IGP loss of adjacency or RSVP hello message extensions. These detection mechanisms between nodes needs a very low time for switching to a backup segment, often only a few milliseconds.

 

For protection switching it is different. It can use the same failure detection as for fast rerouting, but the failure alert must be sent to the ingress or egress depending on if the 1:1 or the 1+1 mechanism is used respectively. Protection switching may take a longer period than fast rerouting and will delay the switching to backup entity. Another aspect is that failure on the MPLS layer itself will not be detected by the detection mechanisms mentioned. To be able to do this, it is necessary to use LSP connectivity verification (CV) packets. The LSP CV mechanism uses three seconds to detect a LSP failure, and in addition comes the time to alert ingress and egress LSR about the failure, and this period may be too long time for protection switching in respect to its intention. A possible solution for this will be presented in 5.3.

 

Most of the fast rerouting mechanisms mentioned so far are dependent of receiving a failure message before they can switch the traffic over to, or retrieve traffic from, the backup entity. This gives these mechanisms a delay problem since it takes some time to detect the failure and inform the LSRs. This delay problem is avoided when using the packet level 1+1 path protection. This mechanism uses two disjointed LSPs and transports the work traffic simultaneously through these. In this way there is no need for failure detection to switch between the protected LSP and the backup LSP. The egress LSR will choose the traffic from one of the two disjointed LSPs. The packet sent may use different time to the ingress LSR on the two disjointed LSPs. This could cause problem when the work traffic is sent on the LSR that uses shortest time, because the packets will be unsynchronized at arrival. To solve this, a sequence number is being attached to packets to avoid forwarding of previously received packets. The drawback with this packet protection mechanism is certainly the duplication of LSPs.

 

ITU-T has mentioned [40] that they may study protection switching in IP-networks. Still, this area is not looked into, or ITU-T has not released any documents describing how protection switching should be done in detail. MPLS fast rerouting for point-to-multipoint is for further study.

4.8    Traffic engineering

ISPs understand that traffic engineering can be leveraged to significantly enhance the operation and performance of their networks. MPLS provides many new possibilities for operators to control their network traffic. Most of these properties are difficult or impossible to perform in IP networks that do not use external policy-based solutions.

 

MPLS gives new functionality for the domain administrator. The capabilities are for instance:

 

To achieve these TE functionalities, LSPs are used in MPLS networks. An aspect of traffic engineering is that the domain operator has the ability to control the working traffic network in a better way. But it is important to limit the amount of LSPs in order to better utilize the network bandwidth and because less LSPs need less monitoring. Also, less LSPs decrease the configuring needs for the operator.

 

While IP is independent on any routers, MPLS reveals the possibility to take control of the traffic from edge to edge of the backbone network. This may break with the current nature of internet, where every node on the network may be independent of every other node. Still, backbone technologies often need the possibility to control traffic because of all the concurrent users affected on the backbone.


MECHANISMS FOR OAM ON MPLS IN LARGE IP BACKBONE NETWORKS (c) 2002 Hallstein Lohne, Johannes Vea, a graduate thesis written for AUC/ERICSSON