[RPKI] deep dive on manifest handling (Was: APNIC had an unexpected drop in VRP 00:00 - 02:00)
Job Snijders
job at ntt.net
Wed Dec 2 21:33:01 UTC 2020
Hi all,
First of all to be very clear: there was no 'APNIC outage', APNIC did
nothing wrong. This was a 'validator outage', and locally outages like
these can continue to be experienced at any future moment until fixed
versions are released and deployed. Note: network operators who run FORT
or OpenBSD rpki-client side-by-side with routinator/octorpki will have
seen a stable VRP merged set item count on their EBGP routers. In this
situation RPKI validator software diversity helped the Internet remain
more stable.
APNIC staff are commendable for having seen an opportunity to implement
a workaround for this routinator 0.8.1 quirk, but APNIC is just one of
the tens of thousands of Certificate Authorities in the RPKI ecosystem.
In short: the observed state of December 1st, 2020 00:00 UTC is an
expected and normal state in the RPKI ecosystem.
I appreciate George for reaching out to the community to draw more
attention to the situation, as it seems we can learn from exploring this
situation in great detail. For many in the community RPKI is a new
technology. Also it appears a similar issue exists in Cloudflare's
OctoRPKI, so I notified their developers too about the problem &
solution. Since there are implementations with a bug in the same
equivalence class, this case is best handed over to the IETF.
While keeping in mind our human perception of the concept of time
generally is somewhat incompatible with how time works in the X.509 /
RPKI crypto world... here are my lengthy debug notes. :-)
TL;DR: the VRP drop is an implementation issue in some RPKI validators, can happen again
solution: wait for fixed version, or run multiple different RPKI validator implementations side by side
there a bit of time pressure: this bug potentially interacts negatively with Juniper PR1483097.
Every 20 minutes I copy all RPKI data from the Internet, run rpki-client
[1], and store the original RPKI data files, the program's execution
log, and the resulting VRP list as individual ZFS snapshots for
post-mortem analysis. A copy of my data can be downloaded: it is an
exact snapshot of all input data from that moment, to replay the event
in various implementations.
http://sobornost.net/~job/rpki-20201201-0001-adrian.sobornost.net.tar.gz
Looking at the process' log of December 1st, 2020 run starting at
midnight for the string 'apnic':
root at adrian:/tank/rpkirepositories/.zfs/snapshot/20201201-0001# fgrep apnic output/log
Dec 01 00:00:01 rpki-client: https://tal.apnic.net/apnic.cer: https schema ignored
Dec 01 00:00:01 rpki.apnic.net/repository: pulling from network
Dec 01 00:00:03 rpki-client: rpki.apnic.net/repository: loaded from cache
Dec 01 00:00:03 rpki-client: rpki.apnic.net/member_repository: pulling from network
Dec 01 00:00:03 rpki-client: rpki.sub.apnic.net/repository: pulling from network
Dec 01 00:00:03 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer: certificate has expired
Dec 01 00:00:03 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/9lv88f3YSSS6iXQmzBvPX6hvnQM.cer: certificate has expired
Dec 01 00:00:03 rpki-client: rpki.rand.apnic.net/repo: pulling from network
Dec 01 00:00:04 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/pBp2e-TKxusbiXQjNgwrQ1OsH_s.cer: certificate has expired
Dec 01 00:00:04 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/ZnMLuaQLNc_lmxGF9iLb0JAMbZA.cer: certificate has expired
Dec 01 00:00:04 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/yZYCtJIcaINWT0smUVwdY-TPNkQ.cer: certificate has expired
Dec 01 00:00:04 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/WFBPIARWFTaBikTQvkFutQVej0g.cer: certificate has expired
Dec 01 00:00:05 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/QmfPXQMASo_v3yE5XQ_oJFSLE8E.cer: certificate has expired
Dec 01 00:00:05 rpki-client: rpki.sub.apnic.net/repository: loaded from cache
Dec 01 00:00:05 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/maB2Nu64AHCDMDGWpYxBvsxoj4A.cer: certificate has expired
Dec 01 00:00:05 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/d0JlIBzwsNjMdvAm-Ir2i1XpkO4.cer: certificate has expired
Dec 01 00:00:05 rpki-client: rpki.apnic.net/repository/B3A24F201D6611E28AC8837C72FD1FF2/0I2GgcK-TUfCopBV9m5olVhGF_c.cer: certificate has expired
Dec 01 00:00:06 rpki-client: rpki.rand.apnic.net/repo: loaded from cache
Dec 01 00:00:12 rpki-client: rpki.apnic.net/member_repository: loaded from cache
(At the end of the process's run it had observed 62,154 VRPs under the
APNIC TAL. A CSV & JSON file of the validation process output with all
VRPs from that moment is also included in the tar.gz file.)
In the above log we see that a number of certificates are expired,
according to Tom's message [2] these certificates represents APNIC
members whose membership has been closed. (for example: companies going
out of business, or merger & acquisition) It is expected for
organizations issuing cryptographic products to tie business events to
validity periods in certificates.
For the purpose of these notes I'll focus only on following the
validation process towards 'ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer' in a manual
fashion using command line utilities.
After having pulled RPKI from the web (which operationally speaking
end-to-end is a multi-hour process to get the data from signer to
validator), a number of process steps have to be performed in order to
produce a list of Validated ROA Payloads (VRPs). None of these steps can
be skipped, and the order is important too.
A single manifest file (https://tools.ietf.org/html/rfc6486) actually is
a bundle of a few things: a start & end date of the file listing, a list
of filenames and sha256 hashes, and a EE certificate (which also has its
own embedded start & end date!), a serial number, and references to
other things such as which entity signed it.
The first step is to figure out whether a given manifest file is 'valid'
(are the signatures right) and 'current' (the timestamp on the
validator's wall clock is between both the manifest's embedded start &
end date AND the EE certificate validity dates), and the 'latest'
(should the validator have to choose between two versions of the file,
both valid and current, pick the one with the highest number).
So at December 1st 00:00:03 UTC, the manifest's start & end date, and
the EE certificate's start and end date were:
$ tar fxz rpki-20201201-0001-adrian.sobornost.net.tar.gz
$ cd 20201201-0001/data/rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2
$ ls -lahtr DmWk9f02tb1o6zySNAiXjJB6p58.mft
-rw-r--r-- 1 job wheel 214K Nov 30 23:01 DmWk9f02tb1o6zySNAiXjJB6p58.mft
This file's ctime appears to be November 30th, 23:01
# check manifest's econtent start & end date
$ strings DmWk9f02tb1o6zySNAiXjJB6p58.mft | head -2
20201130230107Z
20201202230107Z
December 1st 00:00:03 is between November 30th 23:01:07 and December 2nd
23:01:07: check!
# check the manifest's embedded EE certificate start & end date:
$ test-mft -vp DmWk9f02tb1o6zySNAiXjJB6p58.mft | openssl x509 -text | grep -A2 Validity
Validity
Not Before: Nov 30 23:01:07 2020 GMT
Not After : Dec 2 23:01:07 2020 GMT
December 1st 00:00:03 is between November 30th 23:01:07 and December 2nd
23:01:07: check!
With the dates and signatures of the manifest file check out to be 'all
lights green', the next step is to process the manifest's file listing.
A manifest 'file listing' is checked through two steps:
- is the listed file present?
- is the sha256 hash (in base64 format) listed on the manifest the
same as the sha256 hash computed by the validator using a copy of
the listed file?
# looking at manifest file listing:
$ test-mft -v DmWk9f02tb1o6zySNAiXjJB6p58.mft | grep -A1 ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
95: ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
hash YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc=
# checking whether file is present:
$ ls -alhtr ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
-rw-r--r-- 1 job wheel 1.5K Nov 30 23:01 ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
# compute sha256 hash of the file
$ sha256 -b ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
SHA256 (ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer) = YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc=
Indeed, the 'YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc=' hash computed
from the referenced certificate file is the same one as listed in the
manifest file (which we inspected with test-mft)! Note that at this
stage of the validation process the 'ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer'
file has not been processed in any other way other than the equivalent
of that 'sha256' OpenBSD utility.
These 'jumps' from certificate to manifest to certificate using hashes &
signatures serve multiple purposes: by first confirming a hash matches,
the validator does not (yet) need to attempt any file content parsing
(which would potentially be sensitive computing operations on an at that
point in time a unknown and potentially dangerous file), and secondly:
by checking the presence and hash of each file, the publication point's
completeness and integrity is confirmed. Missing .roa files can result
in network outages [3].
At this point the manifest file has been completely processed, the next
step in the validation process can commence. Each and every referenced
file is opened by the validator, embedded certificates and sigantures
are verified, and then again file contents processed (could be
manifests, certificates, CRLs, or ROA files).
Let's inspect ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer:
$ openssl x509 -in ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer -inform DER -text | grep -A2 Validity
Validity
Not Before: Oct 23 10:14:32 2019 GMT
Not After : Dec 1 00:00:00 2020 GMT
As the validator's wall clock was December 1st 00:00:03, we can see that
ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer expired '3 seconds ago'. Note that
before we observed that creation time on the manifest file which
referenced this .cer file was November 30th, at that time this
ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer certificate was valid, present, and
current!
One could say that ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer is a child of
DmWk9f02tb1o6zySNAiXjJB6p58.mft. The ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
file might not even be under control of the entity which generated
DmWk9f02tb1o6zySNAiXjJB6p58.mft. A child's expiry does not result in the
death of the parent. If a validator considers all referenced files on a
manifest to be invalid, solely because *upon further inspection* a file
contained contained an expired EE certificate, I'd say it is an
'overreaction', a simple software defect. After all, there was a valid
current manifest which listed a hash and that hash matched the file, so
the file became eligible for X.509 certificate validation in the first place!
It appears that Routinator conflates two distinct steps in the
validation process:
step 1) checking the validity of a RPKI manifest
step 2) checking the validity of a file referenced from the in step 1 validated manifest
A valid manifest referencing a (now expired) certificate is a legitimate
state of being. What is not valid is for the manifest listing itself to
be expired, or the manifest's EE certificate to be expired, or its CRL
to be expired, or its parent certificate to be expired, or for any files
listed on the manifest to be missing, or for any sha256 hashes to be
different than listed on the manifest. Phew.... that's a mouthful of
conditions! We're gonna have to work in IETF to capture this in simpler
english.
Conclusion
==========
I'm not saying validators should accept expired data, they shouldn't!
But it is *expected* that Certificate Authorities (like LIRs, NIRs, or
even RIRs) set the expiration dates on cryptographic objects to be
aligned with the reality of business contracts. This is a *critical*
feature of the RPKI and makes RPKI superior to IRR data: finally there
/are/ expiration dates on the equivalent of 'route:' objects.
A repeat of the 'december 1st' VRP drop situation can come into
existence at any future moment under any Trust Anchor, under any
Certificate Authority. Simply put: network solely relying on current
versions of octorpki or routinator are somewhat at risk when billing
cycles end. Also, I do not recommend downgrading to older versions
because of https://www.nlnetlabs.nl/projects/rpki/security-advisories/
(which perversely is a bug that *is not* resolved with rpki software
diversity).
I suspect it is OK for network operators to choose to sit this one out
and just wait for a fixed version, provided it can be released in a
manner of weeks. Because of Juniper PR1483097 (which probably still
affects many currently deployed internet routers) the complete
disappearance of VRPs can negatively impact internet traffic forwarding
in the default-free zone, but as mentioned before impact is avoided both
through multi-instance validator deployment combined with validator
software diversity.
There is a silver lining in all this: the most likely next occurance
of this type of situation is January 1st, 2021, as then all kinds of
LIR, NIR, or RIR business contracts are likely to start or stop. This
gives nlnetlabs and cloudflare almost a full month to figure out a fix,
release it, and for operators to deploy it in their networks during the
holidays. The perfect excuse to escape any unwanted christmas dinner. ;-)
I propose some of us continue discussion at sidrops at ietf.org where
through wordsmithing in the draft-ietf-sidrops-6486bis effort so we help
any future RPKI implementers from walking into the same problem.
Kind regards,
Job
[1]: https://pkgs.org/search/?q=rpki-client
[2]: https://lists.nlnetlabs.nl/pipermail/rpki/2020-December/000238.html
[3]: https://blog.apnic.net/2020/11/10/rpki-manifests-securely-declare-contents/
More information about the RPKI
mailing list