Marc Goodman explains why multi-homing, or WAN/ISP link load balancing, is an effective approach to dealing with WAN reliability issues.
Once a disaster occurs, it is too late to implement a business continuity plan. Rather, business continuity represents the activities performed daily to maintain service, consistency, and recoverability. Organizations must ensure that critical business functions will be available to customers, suppliers and regulators as needed. Disasters may range from local events such as building fires and regional events such as earthquakes, hurricanes, storms and floods to national events. Business continuity may also be affected by network connectivity failure, or a WAN link being congested or bottlenecked, which may limit business functions.
The foundation of business continuity is the policies, guidelines, standards, and procedures implemented by an organization. All system design, implementation, support, and maintenance must be based on this foundation in order to have any hope of achieving business continuity and disaster recovery.
IP-based wide area networks (WANs) that utilize Internet connectivity have become the default communication method for organizations in all vertical markets to conduct business through transaction-based applications and for communicating with customers, vendors, partners and remote employees. WAN infrastructure presents many challenges associated with ensuring business continuity and, although Internet Service Providers (ISP) continue to improve upon their ability to deliver consistent service, as long as natural disasters, equipment failures, human error and security threats remain, ISP outages will continue to be an issue. Organizations that rely upon ISP connectivity need to take proactive measures to ensure the resiliency of their business, including their WAN infrastructure. Addressing secure and reliable WAN (Internet) access going out from the LAN, and coming into the LAN is critical for today’s business continuity and disaster recovery planning.
One of the easiest and most cost-effective ways of dealing with WAN reliability issues is multi-homing, using a specialized WAN Optimization Controller (WOC) capable of WAN link load balancing and failover. An organization will use a multi-homed network to bundle two or more WAN links and/or service providers to connect their LAN to the Internet. If they have multiple sites they can also use this technique to interconnect between sites to ensure reliability and optimum performance for critical applications.
This paper describes how small-to-medium sized enterprises can use advanced, yet affordable WAN link optimization solutions to leverage the benefits of multiple sites, while maintaining high-performance and reliability for applications delivered across all sites. It will also discuss how to use WAN link controllers to manage, load balance, and deliver failover of multiple WAN and ISP links to ensure site–to-site network connectivity, and deliver applications reliability. Creating a network that supports disaster recovery plans to keep users connected to critical applications is the key to leveraging the full power of the Internet for business continuity and transaction completion in the face of a disaster.
WAN infrastructure is vital for ensuring business continuity and disaster recovery
A major emphasis for business continuity is the protection of IT systems that enable and support critical business processes, applications and data. For any organization that uses the network to conduct business and communicate with customers, partners and external employees beyond its LAN, the WAN has assumed an ever-increasing role in supporting the automation of business applications such as order fulfillment and communications using email and VoIP.
Many organizations with appropriate budgets are adding fault-tolerant sites (also called disaster recovery sites), and assigning the back-up hosting of their business applications to these sites with greater protection against disasters. Disaster recovery sites deploy replication technologies such as clustering to provide continuous service delivery to users, despite a major incident forcing downtime at the main facility.
Organizations that have increasing dependence upon WAN networks understand the importance of addressing business continuity and disaster recovery planning for IT infrastructure. While much of an organization’s IT infrastructure may be internally owned and operated giving the organization control over its business continuity, at the same time, the external WAN may be outsourced to an ISP or telco. Utilizing an ISP for WAN connectivity places business continuity in the hands of a third party, leaving the organization vulnerable, with much less control over business continuity issues.
It is very common for companies to use back-up applications to ensure that business transactions are completed smoothly. For example, when an order-placement system goes down, email can be used as a back-up to ensure that the order is fulfilled. When email goes down, the telephone can be used to place the orders. When a WAN link or ISP link has an outage, there is nothing inherent in the WAN to backup the link. Consequently, when the WAN link goes down, there is the potential for all of the applications and network services to be unavailable, which can result in a major disruption to business and therefore loss of revenue.
Below are examples of how businesses rely on critical applications being delivered over the WAN (Internet) for their ongoing operations:
- Employee productivity - Remote employees with web browsers accessing corporate applications with VPN or wireless network using handheld devices such as Blackberries to access email
- Product sales - Resellers and agents purchasing, quoting pricing and placing orders
- Order tracking - Shipping fulfillment and tracking through online services from FedEx, UPS, and others.
Removing risk, while reducing the cost of WAN failures
Before making a decision on which solutions are appropriate, and how much budget should be allocated to mitigate the risk of an ISP or WAN link outage, it is important to understand what the likelihood of an ISP or WAN link outage would be and the damaging impact such an outage would have upon a business.
ISP outages are very common, and will continue to occur as long as natural disasters, system failures, human error, security threats and service provider disputes continue. On average, organizations incur an average of 19 hours of ISP outages, and another 9 hours of WAN service degradation. Other events illustrate this point, and demonstrate the wide variety of causes for WAN service disruptions:
- In December of 2004, Global Access Points, an ISP for Notre Dame University accidentally disconnected cables in Chicago that left the university without service for four hours
- In September of 2005, during the Katrina hurricane disaster, most of the telecom facilities in the Gulf areas were wiped out
- In October of 2005, Cogent Communications and Level 3 Communications had a business dispute they failed to solve. As a result, Level 3 disconnected its peering connection with Cogent. This dispute left customers of one provider unable to connect to customers of the other provider, and caused severe Internet slowdowns around the globe.
Recently, a survey was undertaken to understand the frequency, severity, and cause of email outages in North American corporations using Microsoft Exchange, Lotus Notes, and Novell Groupwise. The survey results showed that enterprise email systems are prone to a variety of potential failures including storage area network (SAN) failures, incorrect configuration, losses in network access, database corruption, and viruses.
Survey results showed that within a 12-month period, there is a 75-percent likelihood of an unplanned email outage within any given company. The length of email outages in the companies surveyed ranged from 2 minutes to 120 hours with the average email outage being just over 32 hours long. The largest concentration of outages (29 percent) was between–four-to-24 hours in duration. More than 43 percent of the outages lasted longer than 24 hours—a length of time that can lead to significant business disruption.
The majority of email outages were caused by unplanned events, most of which were due to technological failures (i.e. server hardware) which accounted for 35 percent, 19 percent were due to network connectivity losses, averaging 27.4 hours; 16 percent were due to SAN failures, and 16 percent were due to database corruption. While natural disasters accounted for only 14 percent of unplanned email outages, the average downtime due to such disasters was over 60 hours.
Service level agreements
ISPs’ service level agreements (SLAs) only guarantee nominal financial reimbursement and do not guarantee that a company’s sites will remain up and running during a disaster. Therefore, it remains in the hands of the organizations themselves to protect their vital assets and to ensure the resiliency of their WAN infrastructure. To ensure business continuity the risk and cost from ISP link outages must be addressed. Offering SLAs to appease organizations will not suffice for the following reasons:
- SLAs do not guarantee up-time (they only stipulate reimbursement if up-time requirements are not met)
- SLAs usually only reimburse customers for the cost of the prorated connectivity that was lost - typically a fraction of the cost of the total damages
Dealing with the risk of WAN outages
One of the easiest and most cost-effective ways of dealing with WAN reliability issues is multi-homing (or link load balancing) by using a specialized WAN Optimization Controller (WOC) that is capable of doing link aggregation, load balancing and failover. We will refer to these devices as WAN link controllers.
The value proposition that these products offer is that while little control can be maintained over the continuity of service from a single ISP over a single WAN link, diversifying ISPs and WAN links provisioned over varied physical and logical paths can greatly reduce downtime.
Establishing back-up links is not a new idea. However, what is new is the ability for an organization to deploy affordable WAN link controllers that provide organizations with the flexibility to easily provision low-cost bandwidth links to meet their specific needs, while providing WAN link redundancy with automatic failover. WAN link controllers can:
- Immediately detect ISP and WAN link failures and automatically fail over to an available link – with the transition being virtually transparent to users
- Utilize multiple and diverse link types such as frame relay, T1 combined with ADSL broadband, wireless, and others to create a cost-effective yet resilient network
- Provide simultaneous utilization of all available links and available bandwidth (via link load-balancing), so that connectivity costs are not wasted on a underutilized back-up line
- Work independently of ISP peering relationships. ISP peering relationships and cooperation are not needed, and there are no problems with supporting disparate IP address spaces issued by different ISPs
WAN link controllers compared to BGP
As discussed earlier, reducing costs is a critical part of planning a solid business continuity solution, which should be in line with the expected risks and resulting impact. However, the WAN is getting more complex, and additional critical applications are running over the WAN every day. Organizations need solutions that address their specific business continuity needs, while at the same time, not compromising the reliability and performance of critical applications delivered over the WAN.
ISPs and large enterprises have multi-homed for years using Border Gateway Protocol (BGP) to connect to multiple Internet backbones, but BGP has many restrictions. For one, it requires that ISPs cooperate with each other and set up ‘peering’ agreements between routers, but because of the performance impact to their networks, many are not willing to do so. BGP also requires expensive routers, designated address blocks and an Address Space Number (ASN), which are sometimes not available to small businesses. BGP requires that gateway hosts exchange dynamic routing tables, which must be constantly synchronized which can lead to delays of up to 30 minutes in changing the traffic direction.
WAN link controllers use Network Address Translation (NAT) to unify traffic coming from and going to different destination IP addresses on the Internet. They can be configured with at least one routable IP address for each router/WAN link that is connected to the network.
The biggest benefit of WAN link controllers resides in their ability to conduct outgoing and incoming load balancing and failover without defining BGP routing tables or utilizing any of the underlying complicated routing techniques. The ability to offer this functionality without the expensive or complicated networks and equipment necessary to use BGP is what makes them affordable, especially for small and medium-sized enterprises.
Get it right – choosing the WAN link controller
* Outbound bandwidth aggregation - WAN link controllers should provide both outgoing bandwidth aggregation at the TCP/UDP session layer on a per-session basis. The user defines weights (bandwidth capacity) based on the bandwidth of each WAN link. When a session is generated from the LAN, the device computes which link has the most available bandwidth and routes traffic from that session over that particular WAN link. The device typically allows the selection of two link load balancing algorithms:
- Symmetrical round robin - routes sessions to all links in a round robin manner.
- Intelligent (weighted) load balancing - computes a ratio between the weights (bandwidth capacity) of the different WAN links, and then routes sessions accordingly. That is, the faster the link, the more sessions that will be sent over that link in order to make the most efficient use of all the bandwidth available.
* Inbound bandwidth aggregation - incoming bandwidth aggregation is accomplished by the WAN link controller acting as the authoritative DNS server for the domain. The device advertises all available WAN links to the DNS cache servers which in turn resolve the domain names to queries in a round robin format. In this manner, all externally initiated sessions are load balanced over all available links. Since the device is resident at the domain site and is able to directly monitor the link status, failed links are removed from the DNS tables immediately... By setting the host name record Time-to-Live (TTL) to a short period, the DNS caching servers will flush their address tables and will update them from the device regularly, and thus be informed when a link fails.
* Cost-effective - a quality WAN link controller should deliver easy and affordable WAN/ISP link aggregation, inbound and outbound load-balancing, failover and optionally, point-to-point channel bonding. A WAN link controller enables SMEs to leverage low-cost links, eliminate link congestion and bottlenecks, and use the device’s QoS traffic management features to guarantee minimum bandwidth to specific applications. Moreover, these companies can take advantage of the cost of a consumer ADSL link, but receive business connectivity at that price. Not only is there flexible capacity – there are also cost-effective links from multiple ISPs, so that if one link goes down, there is automatic switchover to the other links.
Through bundling (aggregating) multiple, diverse Internet links from one or more ISPs, the WAN link controller reduces the need to purchase multiple and expensive high-speed links. This enables an increasing bandwidth by using cost-effective links without compromising up-time. In addition to managing scalability and redundancy, the device cost-effectively utilizes all available WAN bandwidth through intelligent link load balancing, with features such as capacity, and other quality of service routing. WAN link controllers provide controls for how bandwidth is used to support applications and connectivity. This allows SMEs to take advantage of the most cost-effective ISP rates, while ensuring appropriate levels of bandwidth are available for specific applications.
WAN link controllers should allow the choice of WAN link performance/cost ratio that best fits a company’s needs; along with complete service provider independence; and the elimination of the complexity of network protocols such as border gateway protocol (BGP). The device’s inbound and outbound bandwidth aggregation capability combines two or more Internet connections and provides Internet-based applications with access to the total available combined bandwidth. Bandwidth aggregation supports link load balancing to route Internet sessions from congested links, to links with more available bandwidth. It also provides automatic failover of Internet sessions from failed links to functional connections to eliminate a single-point-of-failure. For example, if a company has a T1 line (1.5 Mbps), and needs additional bandwidth, it would typically have to upgrade to a T3 line (45 Mbps). However, this may be significantly more bandwidth than required, and will be a significant increase in cost.
With a WAN link controller, this same scenario can be accomplished with two 768 Kbps DSL links that can be combined for a total aggregated bandwidth equivalent to a T1 - at a fraction of the cost. The company can also add additional lower speed links such as xDSL, cable, wireless, and others, with a relatively small increase in cost that can more closely match needs. In addition to receiving more cost-effective bandwidth, the reliability of the WAN network is dramatically increased due to the new levels of redundancy through the aggregation of multiple Internet links.
* Easy-to-use - WAN link controllers should have an easy-to-use web user interface with an intuitive administration capability. The web user interface allows IT managers to define, manage and control multiple, diverse WAN links, bandwidth and QoS settings across the WAN network.
Redundant Internet access - Redundant Internet access is the ability to switch traffic among multiple Internet connections through multi-homing, which more and more small and medium-sized companies are finding they need. When one link goes down, WAN and ISP failover automatically switches Internet traffic to an appropriately functioning link.
Additionally, bandwidth management enables guaranteed bandwidth to the most important application needs.
* WAN and ISP failover - when a WAN link controller detects a link failure it should automatically update the DNS record for a domain so that the server requests are sent to the IP address of an alternate server or server cluster. WAN link controllers should also provide for device failover through its active/passive failover capability. This eliminates the chance of the WAN link controller being the single-point-of-failure.
* Site redundancy - many businesses need to redirect Internet traffic to a disaster recovery site should a catastrophe disrupt a main site. WAN link controllers have, in effect, reduced the cost to ensure that site failover and fallback occur automatically, and reliably, making this functionality practical and affordable even for the smallest businesses.
* Quality of Service (QoS) - QoS is the ability to prioritize network traffic to ensure that adequate bandwidth is always available to specific bandwidth-intensive applications, especially during periods of congestion. QoS rules determine bandwidth minimums and maximums for specific types of traffic and use load balancing and automatic failover to direct this traffic to links with sufficient bandwidth. Based on user-defined QoS policies, WAN link controllers control the link bandwidth allocations to support applications that are prioritized, and which ISP links they are going over.
* Availability - WAN link controllers should also be configured in a high availability mode with one WAN link controller acting as the primary, and a second WAN link controller as a standby.
* Performance - performance of applications over the WAN directly affects response time. This includes not only total average transaction time, but assures that users located at performance-challenged sites (such as branch offices) still receive the acceptable level of performance. The WAN link controller should support extremely high volumes of traffic transmitted to and from sites. A simple definition of performance is how many bits-per-second the device can support. While this is extremely important, in the case of a WAN optimization controller, other key measures of performance include how many WAN links, concurrent sessions and domain names and how many users can be supported simultaneously.
* Security - WAN link controllers should address issues specific to applications crossing the network, such as encryption, authentication, and detecting DoS attacks, intrusions, and malicious behavior.
* Firewall and VPN load balancing and redundancy - organizations with multiple sites use multiple WAN links with VPN load balancing to provide the security, failover protection and cost-based link flexibility. A key benefit to VPN load balancing is the ability to support VPN connections using multiple and diverse WAN links that work together. By using multiple ISPs, data and applications can securely travel through the aggregated VPN network from any of the links on the transmitting side, to any of the links on the receiving side. WAN link controller VPN load balancing should be able to switch the order of the IPSec packets as they come from the VPN server, and based on load balancing decisions, send the packets to multiple sites with WAN link controllers located at both ends. This capability makes it difficult for intruders to assemble IPSec packets, organize them in the right order, and decrypt them.
* Scalability - It is important to understand how many users can have access to available network resources without having to spend large amounts of money to upgrade the network. Scalability of a WAN link controller implies the availability of a range of products that span the performance and cost requirements of a variety of datacenter environments. Performance requirements for accessing datacenter applications and data resources are usually characterized in terms of both the aggregate throughput of the WAN link controller, and the number of simultaneous application sessions that can be supported.
* Site-to-site channel bonding - WAN link controllers should support site-to-site channel bonding to provide the ability to bond multiple Internet links into a single high-bandwidth channel for uninterrupted availability for applications. Channel bonding is a form of load balancing which allows for stateful failover of traffic to the best performing links to ensure critical applications avoid problems that occur when they are stopped on one link and restarted over another link. Site-to-site channel bonding ensures that critical applications avoid failures, and are never adversely affected, even after brief disruptions.
For IT organizations, a solid business continuity plan should focus not only on data and application protection, but also on WAN and Internet resiliency that are necessary to ensure continuous access to data and applications delivered over the Internet in the event of a disruption or disaster. ISP outages are a reality today, and they will certainly continue into the future, and the integrity of a business cannot be left only in the hands of third-party ISPs.
Multi-homing, or WAN/ISP link load balancing, is an effective approach to dealing with WAN reliability and performance issues. WAN link controllers provide a quick return on investment, compared to multi-homing approaches such as BGP. Selecting the appropriate multi-homing solution that addresses your business continuity needs and relevant business applications, without compromising their performance when delivered over the network is critical.
About the author
Marc Goodman is the director of marketing at Ecessa, a manufacturer of advanced WAN Optimization products that provide WAN and ISP link aggregation, intelligent load balancing, failover, QoS and VPN load balancing and failover within a single device.
•Date: 8th April 2009• Region:World •Type: Article •Topic: IT continuity
Rate this article or make a comment - click here