[RPKI] Filtering of Unsafe VRPs

Thu Sep 24 13:30:53 UTC 2020

Hi Jay, all,

So, first off.. for hosted CAs under RIRs / NIRs resources are delegated to one member CA only, and the RIR and NIR parents do not create ROAs. So, normally this filtering should be a no-op.

That said there are situation where delegated CAs create ROAs and sub-delegate prefixes to subsidiary CAs. At least in theory - and with the ascent of non-hosted, we may see this more in practice. In these cases a complete rejection of the subsidiary CA as per -bis can actually lead to creating RPKI invalids, contrary to its intent to avoid them.

The filtering of unsafe VRPs is an attempt to avoid this. But it adds complexity, and it adds its own corner cases. So, this could really benefit from constructive discussion. Discussion on this in the IETF has proven to be somewhat complicated, but an interim meeting has been planned so it would be nice to get some feedback here as well prior to that. My apologies for the wall of text but this is a delicate issue.

Also, *yes* I also prefer simple solutions, least surprise and soft landings. But I hope that I can demonstrate here that by applying the reject-all in -bis we are already introducing complex behaviour, and introducing surprises and hard landings. The filtering may make this better, or worse. I am not sure yet.

Let me explore a scenario where we have intermittent overclaims. I believe that this will be the most likely and common reason why objects are found to be invalid. Simply because it cannot, presently, be automatically avoided:

Phase 1:
========

A parent holds a prefix, announces it, has a ROA for it. But they delegate and allow a customer to announce some more specifics.

Parent:
-------
holds:                192.168.0.0/16
roas & announcements: 192.168.0.0/16 from AS65000

Child:
------
holds:                192.168.0.0/24
                      192.168.21.0/24
roas & announcements: 192.168.0.0/24 from AS65001
                      192.168.21.0/24 from AS65001

All announcements above are RPKI valid. Good.

Phase 2:
========

Now consider that the parent decides that the child can no longer announce one of the prefixes. They shrink the certificate issued to the child. Unfortunately, there is nothing in the RFC6492 protocol which allows the parent to inform the child that resource will be removed in future. So, unless a manual action is taken by the child CA operator in advance to remove the offending ROA, their CA will still happily publish a ROA for the prefix they no longer hold:

Parent:
-------
holds:                192.168.0.0/16
roas & announcements: 192.168.0.0/16 from AS65000

Child:
------
holds:                192.168.0.0/24
roas & announcements: 192.168.0.0/24 from AS65001
                      192.168.21.0/24 from AS65001 (ROA object invalid)

Under the -bis rules we should now reject the publication point of the child in its entirety. And because the parent has a covering ROA and announcement both specific announcement by the child are now RPKI Invalid. The announcement by the parent is RPKI Valid.

Phase 3:
========

Some time later the child's CA finds out that it has lost resources. In case of Krill the child will query the parent for its entitlements every 10 minutes. Some might say this is aggressive, but it's really because of the situation above. I can't speak for other CA implementations, but in case of Krill, it would then stop publishing the offending ROA (I suspect though that rpkid by DRL does the same). Resulting:

Parent:
-------
holds:                192.168.0.0/16
roas & announcements: 192.168.0.0/16 from AS65000

Child:
------
holds:                192.168.0.0/24
roas & announcements: 192.168.0.0/24 from AS65001 (RPKI Valid)

                      192.168.21.0/24 from AS65001 (no ROA, covering exists, so RPKI Invalid)

So, the big question here is: are we okay with the child's remaining announcements being considered RPKI Invalid in phase 2? Because if we are not, then the -bis needs work.

Filtering of Unsafe VRPs
=========================

Note that the filtering is done on the *remaining* resources for the child only. I.e. 192.16.0.0/24. Since the covering ROA for 192.168.0.0/16 includes this prefix it would be ignored. As a result all announcements for 192.168.0.0/16 and more specific would become RPKI Not Found. In this particular example that includes the announcement for 192.168.21.0/24 as well.

This landing could be considered softer. But, unfortunately it is not without problems either.

Theoretically a child could be malicious.. and issue a ROA for resource that it know it does not have in order to force filtering out of VRPs for all the resources delegated to it. I.e. it could just produce the extra ROA on purpose.

I say theoretically, because at least Krill simply does not allow an operator to do this. They would have to go to some lengths to make this work. But still with enough effort and knowledge it can be hacked. If it should happen then the parent can mitigate this by revoking the delegation to this child - but this would be after the fact of course.

Question then is whether this is worse than considering the child's remaining announcement as RPKI Invalid for the duration of phase 2. I honestly don't know. But I think it's part of the discussion that we should have.

Ignore overclaims?
==================

Personally I would prefer a different approach where validation software can make an exception in the -bis behaviour for overclaiming objects. The ROA for 192.168.21.0 in phase 2 is overclaiming, but it's (presumably) not corrupt, or badly signed etc. An RP tool can detect this easily. I would prefer then that this object is considered invalid, but the remaining objects are not rejected. With that approach 192.168.0.0/24 would remain RPKI Valid in phase 2. Note that

Note that the rejection of a ROA object which contains resources not held by the CA cannot lead to RPKI invalids due to accepting other ROAs for this CA for other prefixes.

But a complication exists if CAs combine multiple prefixes on a single ROA. Then rejecting the complete ROA could lead to invalids in case there are prefixes on it that the CA still holds, and they issued other non-overclaiming ROAs for those prefixes. This is why I suggested on sidrops that in these cases one could add the *intersection* of resources on the rejected objects and the issuing certificate to a filter.

A simpler less forgiving approach could be to say that it's okay to treat individual overclaiming ROAs as invalid as long as they do NOT include resources still held by the CA.

Prevent overclaims?
===================

I intend to propose changes to the RFC6492 protocol so that a parent can indicate to a child CA:
 1. when a certificate with *new* resources will be published, and
 2. that resources will be removed soon (before the certificate is shrunk)

This should help with the overclaim case - and this is the case I am most afraid of here. However, this is very early days. I still need to start on documenting this, try to get co-authors, and try to raise it in the IETF. And even then it may be a while before CAs can implement it. So, don't hold your breath ;)

Tim

> On 23 Sep 2020, at 21:55, Jay Borkenhagen via RPKI <rpki at lists.nlnetlabs.nl> wrote:
> 
> Hi Martin,
> 
> Apologies for not replying sooner.
> 
> I appreciate the goal behind this: to reduce the chances that
> something broken within the RPKI system will cause ISP networks to
> mis-categorize legit BGP announcements as 'rpki-invalid', instead
> falling back to 'rpki-not-found'.
> 
> But in thinking this through, I keep coming back to a feeling that
> overall it would be best to keep things simple.  For RPs like
> Routinator that would mean looking at the entire RPKI (I admit there
> are devilish details buried in "entire"), applying the standardized
> set of rules to vet the retrieved objects, then publishing VRPs for
> everything that checks out OK.
> 
> I am a big believer in the Principle of Least Astonishment (POLA).  In
> this context POLA would seem to say that a problem in generating one
> VRP somewhere should not cause a wide swath of perfectly acceptable
> VRPs to disappear.
> 
> In addition, if "Filtering of Unsafe VRPs" were to go forward, I think
> that should happen only as a new standard that all RP implementations
> should be expected to adopt.  Many networks do run different flavors
> of RPs, and there's no value in one RP filtering away the dangers if
> the others do not do so as well.
> 
> 
> I would certainly like to hear more discussion of this topic.
> 
> Thanks!!
> 
> 					Jay B.
> 
> 
> On 11-September-2020, Martin Hoffmann via RPKI writes:
>> Dear mailing list,
>> 
>> we are currently working on improving the filtering of potentially
>> unsafe VRPs in Routinator. With ‘unsafe’ we mean VRPs whose presence
>> may accidentally make legitimate announcements RPKI invalid. Before we
>> decide on the concrete strategy to implement we would very much like to
>> hear feedback from users.
>> 
>> The reasoning on the strategy we are currently preferring started with
>> the observation that filtering individual ROAs may lead to legitimate
>> route announcements being dropped because ROAs only contain a single
>> originating AS. As a consequence, a prefix possibly being announced by
>> multiple ASs needs to be authorized via multiple ROAs. If the ROA for
>> the AS currently announcing the prefix gets dropped for whatever
>> reason, the prefix becomes invalid and its route gets dropped.
>> 
>> As an immediate measure, it was proposed to drop all objects published
>> by a CA that has any invalid objects whatsoever. This will make all
>> ROAs published by this CA disappear but also all child CAs. While this
>> will solve the above issue for most practical cases, there is still a
>> possibility that a prefix is delegated to multiple CAs which
>> independently publish ROAs authorising different ASs. If only the ROAs
>> published by one of those CAs is dropped, the result may again be a
>> falsely dropped route. The only way to avoid this is to drop all
>> authorisations (i.e., VRPs) for all address resources delegated to the
>> invalid CA.
>> 
>> However, this is still not quite enough. If the VRP for a legitimate
>> route is dropped because of above filtering and there is a less
>> specific VRP with a max-length not covering the route, then the route
>> suddenly becomes invalid too. To avoid that, we need to filter all VRPs
>> that overlap with any of the resources of invalid CAs.
>> 
>> We have implemented an initial version of exactly that: It creates a
>> list of all the resources from all the CAs that had to be dropped
>> because of issues. When creating the final set of VRPs, it filters out
>> all VRPs whose address prefixes overlap with any of these resources,
>> making these resources guaranteed to be RPKI unknown.
>> 
>> There is a new metric routinator_vrps_unsafe in Prometheus output or
>> unsafe-vrps in status output that counts the VRPs filtered that way. At
>> the time of writing, around 3300 VRPs or 1.9 % were filtered this way.
>> 
>> You can follow this work in a draft PR on Routinator’s Github
>> repository:
>> 
>>   https://github.com/NLnetLabs/routinator/pull/377
>> 
>> The question here is, whether this aggressive filtering will improve
>> the overall security of the system. This kind of filtering can -- at
>> least in theory -- be used by an attacker to actively push their target
>> space to RPKI unknown as an initial step of a route hijack. I think
>> this is relatively difficult to achieve and the risk subsequently is
>> very low, but I’d love to hear opinions -- particularly arguments
>> against this filtering strategy.
>> 
>> Kind regards,
>> Martin
>> -- 
>> RPKI mailing list
>> RPKI at lists.nlnetlabs.nl
>> https://lists.nlnetlabs.nl/mailman/listinfo/rpki
> -- 
> RPKI mailing list
> RPKI at lists.nlnetlabs.nl
> https://lists.nlnetlabs.nl/mailman/listinfo/rpki