Using all IPv6 addresses on an Elastic Network Interfaces in EC2 instances

ENIs come with a number of IPv4 and IPv6 interfaces (current numbers are here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#AvailableIpPerENI), but what happens if for some reason we need more than what a single ENI can support? The answer is to use multiple ENIs attached to the same instance. Whilst this works out-of-the-box for IPv4, IPv6 requires some further setup.

The problem

AWS networking is unlike any "regular" network infrastructure, for example in it's default scenario multiple ENIs on the same EC2 instance can be connected to the same subnet (for both IPv4 and IPv6). A sample instance might have the following configuration (when DHCPv6 and RA is enabled):

# ip -6 a sh dev ens5
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    inet6 2406:da1c:150:2e01::a0/128 scope global dynamic noprefixroute
       valid_lft 383sec preferred_lft 83sec
    inet6 2406:da1c:150:2e01::a3/128 scope global dynamic noprefixroute
       valid_lft 383sec preferred_lft 83sec
    inet6 2406:da1c:150:2e01::a2/128 scope global dynamic noprefixroute
       valid_lft 383sec preferred_lft 83sec
    inet6 2406:da1c:150:2e01::a1/128 scope global dynamic noprefixroute
       valid_lft 383sec preferred_lft 83sec
    inet6 fe80::57:76ff:fec9:bd8c/64 scope link
       valid_lft forever preferred_lft forever

# ip -6 a sh dev ens6       
3: ens6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    inet6 2406:da1c:150:2e01::a7/128 scope global dynamic noprefixroute
       valid_lft 392sec preferred_lft 92sec
    inet6 2406:da1c:150:2e01::a6/128 scope global dynamic noprefixroute
       valid_lft 392sec preferred_lft 92sec
    inet6 2406:da1c:150:2e01::a5/128 scope global dynamic noprefixroute
       valid_lft 392sec preferred_lft 92sec
    inet6 2406:da1c:150:2e01::a4/128 scope global dynamic noprefixroute
       valid_lft 392sec preferred_lft 92sec
    inet6 fe80::8:b3ff:fefa:f10e/64 scope link
       valid_lft forever preferred_lft forever

The first obvious thing to notice is that global IPv6 addresses are all a /128 and only the link ones are part of a network.

Let's have a look at routing:

# ip -6 r sh
::1 dev lo proto kernel metric 256 pref medium
2406:da1c:150:2e01::/64 dev ens5 proto ra metric 100 pref medium
2406:da1c:150:2e01::/64 dev ens6 proto ra metric 200 pref medium
blackhole fd00::7:7d80/122 dev lo proto bird metric 1024 pref medium
fe80::/64 dev ens6 proto kernel metric 256 pref medium
fe80::/64 dev ens5 proto kernel metric 256 pref medium
default via fe80::b:f3ff:fea0:fbe dev ens5 proto ra metric 100 expires 1799sec pref medium
default via fe80::b:f3ff:fea0:fbe dev ens6 proto ra metric 200 expires 1799sec pref medium

The odd thing here is that we have 2 default routes with a different metric. In practical terms that means that all traffic will be leaving via ens5.

So let's debug some packets to see how they flow, first to on the ens5 interface (pinging from a host outside of AWS):

# tcpdump -ni ens5 icmp6
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens5, link-type EN10MB (Ethernet), capture size 262144 bytes

12:11:48.850639 IP6 2406:e001:3:270e::2 > 2406:da1c:150:2e01::a1: ICMP6, echo request, seq 0, length 16
12:11:48.850692 IP6 2406:da1c:150:2e01::a1 > 2406:e001:3:270e::2: ICMP6, echo reply, seq 0, length 16

Nothing unusual to see here - packets enter and leave as expected.

What about an IP assigned to the ens6 interface?

# tcpdump -ni ens6 icmp6
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens6, link-type EN10MB (Ethernet), capture size 262144 bytes
12:16:03.061318 IP6 2406:e001:3:270e::2 > 2406:da1c:150:2e01::a7: ICMP6, echo request, seq 11, length 16

We can only see the ICMPv6 request, but not the response. That's because the response is leaving via ens5:

# tcpdump -ni ens5 icmp6
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens5, link-type EN10MB (Ethernet), capture size 262144 bytes
12:17:13.304446 IP6 2406:da1c:150:2e01::a7 > 2406:e001:3:270e::2: ICMP6, echo reply, seq 81, length 16

but the host we're pinging from is not getting the response:

 ping6 2406:da1c:150:2e01::a7
PING6(56=40+8+8 bytes) 2406:e001:3:270e::2 --> 2406:da1c:150:2e01::a7
^C
--- 2406:da1c:150:2e01::a7 ping6 statistics ---
100 packets transmitted, 0 packets received, 100.0% packet loss

What about if we deleted the default route via ens5?

ip -6 r d default via fe80::b:f3ff:fea0:fbe dev ens5

Now we can ping the second IP:

ping6 2406:da1c:150:2e01::a7
PING6(56=40+8+8 bytes) 2406:e001:3:270e::2 --> 2406:da1c:150:2e01::a7
16 bytes from 2406:da1c:150:2e01::a7, icmp_seq=0 hlim=47 time=33.753 ms
16 bytes from 2406:da1c:150:2e01::a7, icmp_seq=1 hlim=47 time=34.124 ms
^C
--- 2406:da1c:150:2e01::a7 ping6 statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/std-dev = 33.753/33.939/34.124/0.185 ms

but not the first one. Also that default route comes back quite quickly due to the RA packet, which breaks the connectivity again.  That happens even with source-destination-check disabled on the ENI.

The solution

So the problem we have is that depending  which IPv6 address response traffic originates from it's supposed to use a different egress interface. Luckily this can be easily solved using standard Linux iproute2 package.

Linux has a concept of multiple routing tables, but by default only a few of them are used. In Ubuntu they're stored in /etc/iproute2/rt_tables, that file provides a mapping between a human-readable names and a 8-bit integers:

#
# reserved values
#
255     local
254     main
253     default
0       unspec
#
# local
#
#1      inr.ruhep

(the mapping doesn't have to be used, you can use just a number instead)

In order to determine which table should be used for each packet Linux kernel inspects the rules. Again by default they're fairly simple:

# ip -6 rule sh
0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default

Routes for each of those tables can be inspected:

# ip -6 r sh table local
local ::1 dev lo proto kernel metric 0 pref medium
local 2406:da1c:150:2e01::a0 dev ens5 proto kernel metric 0 pref medium
local 2406:da1c:150:2e01::a1 dev ens5 proto kernel metric 0 pref medium
local 2406:da1c:150:2e01::a2 dev ens5 proto kernel metric 0 pref medium
local 2406:da1c:150:2e01::a3 dev ens5 proto kernel metric 0 pref medium
local 2406:da1c:150:2e01::a4 dev ens6 proto kernel metric 0 pref medium
local 2406:da1c:150:2e01::a5 dev ens6 proto kernel metric 0 pref medium
local 2406:da1c:150:2e01::a6 dev ens6 proto kernel metric 0 pref medium
local 2406:da1c:150:2e01::a7 dev ens6 proto kernel metric 0 pref medium
anycast fe80:: dev ens5 proto kernel metric 0 pref medium
anycast fe80:: dev ens6 proto kernel metric 0 pref medium
local fe80::8:b3ff:fefa:f10e dev ens6 proto kernel metric 0 pref medium
local fe80::57:76ff:fec9:bd8c dev ens5 proto kernel metric 0 pref medium
multicast ff00::/8 dev ens6 proto kernel metric 256 pref medium
multicast ff00::/8 dev ens5 proto kernel metric 256 pref medium

So, to put it all together, we need:

  • 2 new route tables (lets call them ens5 and ens6) that have the default route pointing down their own interface
  • A number of rules to tell the kernel which packets should be subject to that special routing

Creating route tables

The next hop for IPv6 traffic always seems to be the same fe80::b:f3ff:fea0:fbe address, regardless of the actual network.

So first let's create entries in /etc/iproute2/rt_tables:

5  ens5
6  ens6

next - populate the route tables:

ip -6 route add default via fe80::b:f3ff:fea0:fbe dev ens6 table ens6
ip -6 route add default via fe80::b:f3ff:fea0:fbe dev ens5 table ens5

and finally add the source-based rules:

ip -6 rule add from 2406:da1c:150:2e01::a7/128 table ens6
ip -6 rule add from 2406:da1c:150:2e01::a6/128 table ens6
ip -6 rule add from 2406:da1c:150:2e01::a5/128 table ens6
ip -6 rule add from 2406:da1c:150:2e01::a4/128 table ens6
ip -6 rule add from 2406:da1c:150:2e01::a0/128 table ens5
ip -6 rule add from 2406:da1c:150:2e01::a3/128 table ens5
ip -6 rule add from 2406:da1c:150:2e01::a2/128 table ens5
ip -6 rule add from 2406:da1c:150:2e01::a1/128 table ens5

Now, we can reach IPs on both interfaces from outside:

# ping6 -c 3 2406:da1c:150:2e01::a7
PING6(56=40+8+8 bytes) 2406:e001:3:270e::2 --> 2406:da1c:150:2e01::a7
16 bytes from 2406:da1c:150:2e01::a7, icmp_seq=0 hlim=47 time=37.032 ms
16 bytes from 2406:da1c:150:2e01::a7, icmp_seq=1 hlim=47 time=34.345 ms
16 bytes from 2406:da1c:150:2e01::a7, icmp_seq=2 hlim=47 time=34.597 ms

--- 2406:da1c:150:2e01::a7 ping6 statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/std-dev = 34.345/35.325/37.032/1.212 ms
 
# ping6 -c 3 2406:da1c:150:2e01::a1
PING6(56=40+8+8 bytes) 2406:e001:3:270e::2 --> 2406:da1c:150:2e01::a1
16 bytes from 2406:da1c:150:2e01::a1, icmp_seq=0 hlim=47 time=33.583 ms
16 bytes from 2406:da1c:150:2e01::a1, icmp_seq=1 hlim=47 time=35.032 ms
16 bytes from 2406:da1c:150:2e01::a1, icmp_seq=2 hlim=47 time=35.264 ms

--- 2406:da1c:150:2e01::a1 ping6 statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/std-dev = 33.583/34.626/35.264/0.744 ms

This configuration can be added for example to rc.local to make sure it's executed every time the instance starts.

Credits

Photo by Onur K on Unsplash