[RPKI] deep dive on manifest handling (Was: APNIC had an unexpected drop in VRP 00:00 - 02:00)
Chris Caputo
ccaputo at alt.net
Wed Dec 2 22:44:39 UTC 2020
Agreed - fine work and detail by Job!
At the SeattleIX, I'll be happy to test Routinator updates, since that is
part of our toolchain for our route servers.
Thanks,
Chris
On Wed, 2 Dec 2020, Tony Tauber via RPKI wrote:
> Thank you, Job for your excellent and detailed and constructive analysis!
>
> Now get some rest. :-)
>
> Definitely explains why we saw what we saw (Routinator affected, RIPE
> Validator not).
> At this point we are moving to pivot to FORT and perhaps rpki-client as
> well given the recent experiences.
>
> Thanks again!
>
>
> Tony
>
> On Wed, Dec 2, 2020 at 4:33 PM Job Snijders <job at ntt.net> wrote:
>
> > Hi all,
> >
> > First of all to be very clear: there was no 'APNIC outage', APNIC did
> > nothing wrong. This was a 'validator outage', and locally outages like
> > these can continue to be experienced at any future moment until fixed
> > versions are released and deployed. Note: network operators who run FORT
> > or OpenBSD rpki-client side-by-side with routinator/octorpki will have
> > seen a stable VRP merged set item count on their EBGP routers. In this
> > situation RPKI validator software diversity helped the Internet remain
> > more stable.
> >
> > APNIC staff are commendable for having seen an opportunity to implement
> > a workaround for this routinator 0.8.1 quirk, but APNIC is just one of
> > the tens of thousands of Certificate Authorities in the RPKI ecosystem.
> > In short: the observed state of December 1st, 2020 00:00 UTC is an
> > expected and normal state in the RPKI ecosystem.
> >
> > I appreciate George for reaching out to the community to draw more
> > attention to the situation, as it seems we can learn from exploring this
> > situation in great detail. For many in the community RPKI is a new
> > technology. Also it appears a similar issue exists in Cloudflare's
> > OctoRPKI, so I notified their developers too about the problem &
> > solution. Since there are implementations with a bug in the same
> > equivalence class, this case is best handed over to the IETF.
> >
> > While keeping in mind our human perception of the concept of time
> > generally is somewhat incompatible with how time works in the X.509 /
> > RPKI crypto world... here are my lengthy debug notes. :-)
> >
> > TL;DR: the VRP drop is an implementation issue in some RPKI validators,
> > can happen again
> > solution: wait for fixed version, or run multiple different RPKI validator
> > implementations side by side
> > there a bit of time pressure: this bug potentially interacts negatively
> > with Juniper PR1483097.
> >
> > Every 20 minutes I copy all RPKI data from the Internet, run rpki-client
> > [1], and store the original RPKI data files, the program's execution
> > log, and the resulting VRP list as individual ZFS snapshots for
> > post-mortem analysis. A copy of my data can be downloaded: it is an
> > exact snapshot of all input data from that moment, to replay the event
> > in various implementations.
> > http://sobornost.net/~job/rpki-20201201-0001-adrian.sobornost.net.tar.gz
> >
> > Looking at the process' log of December 1st, 2020 run starting at
> > midnight for the string 'apnic':
> >
> > root at adrian:/tank/rpkirepositories/.zfs/snapshot/20201201-0001# fgrep
> > apnic output/log
> > Dec 01 00:00:01 rpki-client: https://tal.apnic.net/apnic.cer: https
> > schema ignored
> > Dec 01 00:00:01 rpki.apnic.net/repository: pulling from network
> > Dec 01 00:00:03 rpki-client: rpki.apnic.net/repository: loaded from
> > cache
> > Dec 01 00:00:03 rpki-client: rpki.apnic.net/member_repository:
> > pulling from network
> > Dec 01 00:00:03 rpki-client: rpki.sub.apnic.net/repository: pulling
> > from network
> > Dec 01 00:00:03 rpki-client:
> > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer:
> > certificate has expired
> > Dec 01 00:00:03 rpki-client:
> > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/9lv88f3YSSS6iXQmzBvPX6hvnQM.cer:
> > certificate has expired
> > Dec 01 00:00:03 rpki-client: rpki.rand.apnic.net/repo: pulling from
> > network
> > Dec 01 00:00:04 rpki-client:
> > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/pBp2e-TKxusbiXQjNgwrQ1OsH_s.cer:
> > certificate has expired
> > Dec 01 00:00:04 rpki-client:
> > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/ZnMLuaQLNc_lmxGF9iLb0JAMbZA.cer:
> > certificate has expired
> > Dec 01 00:00:04 rpki-client:
> > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/yZYCtJIcaINWT0smUVwdY-TPNkQ.cer:
> > certificate has expired
> > Dec 01 00:00:04 rpki-client:
> > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/WFBPIARWFTaBikTQvkFutQVej0g.cer:
> > certificate has expired
> > Dec 01 00:00:05 rpki-client:
> > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/QmfPXQMASo_v3yE5XQ_oJFSLE8E.cer:
> > certificate has expired
> > Dec 01 00:00:05 rpki-client: rpki.sub.apnic.net/repository: loaded
> > from cache
> > Dec 01 00:00:05 rpki-client:
> > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/maB2Nu64AHCDMDGWpYxBvsxoj4A.cer:
> > certificate has expired
> > Dec 01 00:00:05 rpki-client:
> > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/d0JlIBzwsNjMdvAm-Ir2i1XpkO4.cer:
> > certificate has expired
> > Dec 01 00:00:05 rpki-client:
> > rpki.apnic.net/repository/B3A24F201D6611E28AC8837C72FD1FF2/0I2GgcK-TUfCopBV9m5olVhGF_c.cer:
> > certificate has expired
> > Dec 01 00:00:06 rpki-client: rpki.rand.apnic.net/repo: loaded from
> > cache
> > Dec 01 00:00:12 rpki-client: rpki.apnic.net/member_repository: loaded
> > from cache
> >
> > (At the end of the process's run it had observed 62,154 VRPs under the
> > APNIC TAL. A CSV & JSON file of the validation process output with all
> > VRPs from that moment is also included in the tar.gz file.)
> >
> > In the above log we see that a number of certificates are expired,
> > according to Tom's message [2] these certificates represents APNIC
> > members whose membership has been closed. (for example: companies going
> > out of business, or merger & acquisition) It is expected for
> > organizations issuing cryptographic products to tie business events to
> > validity periods in certificates.
> >
> > For the purpose of these notes I'll focus only on following the
> > validation process towards 'ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer' in a manual
> > fashion using command line utilities.
> >
> > After having pulled RPKI from the web (which operationally speaking
> > end-to-end is a multi-hour process to get the data from signer to
> > validator), a number of process steps have to be performed in order to
> > produce a list of Validated ROA Payloads (VRPs). None of these steps can
> > be skipped, and the order is important too.
> >
> > A single manifest file (https://tools.ietf.org/html/rfc6486) actually is
> > a bundle of a few things: a start & end date of the file listing, a list
> > of filenames and sha256 hashes, and a EE certificate (which also has its
> > own embedded start & end date!), a serial number, and references to
> > other things such as which entity signed it.
> >
> > The first step is to figure out whether a given manifest file is 'valid'
> > (are the signatures right) and 'current' (the timestamp on the
> > validator's wall clock is between both the manifest's embedded start &
> > end date AND the EE certificate validity dates), and the 'latest'
> > (should the validator have to choose between two versions of the file,
> > both valid and current, pick the one with the highest number).
> >
> > So at December 1st 00:00:03 UTC, the manifest's start & end date, and
> > the EE certificate's start and end date were:
> >
> > $ tar fxz rpki-20201201-0001-adrian.sobornost.net.tar.gz
> > $ cd 20201201-0001/data/
> > rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2
> >
> > $ ls -lahtr DmWk9f02tb1o6zySNAiXjJB6p58.mft
> > -rw-r--r-- 1 job wheel 214K Nov 30 23:01
> > DmWk9f02tb1o6zySNAiXjJB6p58.mft
> >
> > This file's ctime appears to be November 30th, 23:01
> >
> > # check manifest's econtent start & end date
> > $ strings DmWk9f02tb1o6zySNAiXjJB6p58.mft | head -2
> > 20201130230107Z
> > 20201202230107Z
> >
> > December 1st 00:00:03 is between November 30th 23:01:07 and December 2nd
> > 23:01:07: check!
> >
> > # check the manifest's embedded EE certificate start & end date:
> > $ test-mft -vp DmWk9f02tb1o6zySNAiXjJB6p58.mft | openssl x509 -text |
> > grep -A2 Validity
> > Validity
> > Not Before: Nov 30 23:01:07 2020 GMT
> > Not After : Dec 2 23:01:07 2020 GMT
> >
> > December 1st 00:00:03 is between November 30th 23:01:07 and December 2nd
> > 23:01:07: check!
> >
> > With the dates and signatures of the manifest file check out to be 'all
> > lights green', the next step is to process the manifest's file listing.
> > A manifest 'file listing' is checked through two steps:
> >
> > - is the listed file present?
> > - is the sha256 hash (in base64 format) listed on the manifest the
> > same as the sha256 hash computed by the validator using a copy of
> > the listed file?
> >
> > # looking at manifest file listing:
> > $ test-mft -v DmWk9f02tb1o6zySNAiXjJB6p58.mft | grep -A1
> > ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
> > 95: ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
> > hash YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc=
> >
> > # checking whether file is present:
> > $ ls -alhtr ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
> > -rw-r--r-- 1 job wheel 1.5K Nov 30 23:01
> > ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
> >
> > # compute sha256 hash of the file
> > $ sha256 -b ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
> > SHA256 (ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer) =
> > YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc=
> >
> > Indeed, the 'YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc=' hash computed
> > from the referenced certificate file is the same one as listed in the
> > manifest file (which we inspected with test-mft)! Note that at this
> > stage of the validation process the 'ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer'
> > file has not been processed in any other way other than the equivalent
> > of that 'sha256' OpenBSD utility.
> >
> > These 'jumps' from certificate to manifest to certificate using hashes &
> > signatures serve multiple purposes: by first confirming a hash matches,
> > the validator does not (yet) need to attempt any file content parsing
> > (which would potentially be sensitive computing operations on an at that
> > point in time a unknown and potentially dangerous file), and secondly:
> > by checking the presence and hash of each file, the publication point's
> > completeness and integrity is confirmed. Missing .roa files can result
> > in network outages [3].
> >
> > At this point the manifest file has been completely processed, the next
> > step in the validation process can commence. Each and every referenced
> > file is opened by the validator, embedded certificates and sigantures
> > are verified, and then again file contents processed (could be
> > manifests, certificates, CRLs, or ROA files).
> >
> > Let's inspect ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer:
> >
> > $ openssl x509 -in ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer -inform DER -text |
> > grep -A2 Validity
> > Validity
> > Not Before: Oct 23 10:14:32 2019 GMT
> > Not After : Dec 1 00:00:00 2020 GMT
> >
> > As the validator's wall clock was December 1st 00:00:03, we can see that
> > ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer expired '3 seconds ago'. Note that
> > before we observed that creation time on the manifest file which
> > referenced this .cer file was November 30th, at that time this
> > ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer certificate was valid, present, and
> > current!
> >
> > One could say that ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer is a child of
> > DmWk9f02tb1o6zySNAiXjJB6p58.mft. The ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
> > file might not even be under control of the entity which generated
> > DmWk9f02tb1o6zySNAiXjJB6p58.mft. A child's expiry does not result in the
> > death of the parent. If a validator considers all referenced files on a
> > manifest to be invalid, solely because *upon further inspection* a file
> > contained contained an expired EE certificate, I'd say it is an
> > 'overreaction', a simple software defect. After all, there was a valid
> > current manifest which listed a hash and that hash matched the file, so
> > the file became eligible for X.509 certificate validation in the first
> > place!
> >
> > It appears that Routinator conflates two distinct steps in the
> > validation process:
> >
> > step 1) checking the validity of a RPKI manifest
> > step 2) checking the validity of a file referenced from the in step 1
> > validated manifest
> >
> > A valid manifest referencing a (now expired) certificate is a legitimate
> > state of being. What is not valid is for the manifest listing itself to
> > be expired, or the manifest's EE certificate to be expired, or its CRL
> > to be expired, or its parent certificate to be expired, or for any files
> > listed on the manifest to be missing, or for any sha256 hashes to be
> > different than listed on the manifest. Phew.... that's a mouthful of
> > conditions! We're gonna have to work in IETF to capture this in simpler
> > english.
> >
> > Conclusion
> > ==========
> >
> > I'm not saying validators should accept expired data, they shouldn't!
> > But it is *expected* that Certificate Authorities (like LIRs, NIRs, or
> > even RIRs) set the expiration dates on cryptographic objects to be
> > aligned with the reality of business contracts. This is a *critical*
> > feature of the RPKI and makes RPKI superior to IRR data: finally there
> > /are/ expiration dates on the equivalent of 'route:' objects.
> >
> > A repeat of the 'december 1st' VRP drop situation can come into
> > existence at any future moment under any Trust Anchor, under any
> > Certificate Authority. Simply put: network solely relying on current
> > versions of octorpki or routinator are somewhat at risk when billing
> > cycles end. Also, I do not recommend downgrading to older versions
> > because of https://www.nlnetlabs.nl/projects/rpki/security-advisories/
> > (which perversely is a bug that *is not* resolved with rpki software
> > diversity).
> >
> > I suspect it is OK for network operators to choose to sit this one out
> > and just wait for a fixed version, provided it can be released in a
> > manner of weeks. Because of Juniper PR1483097 (which probably still
> > affects many currently deployed internet routers) the complete
> > disappearance of VRPs can negatively impact internet traffic forwarding
> > in the default-free zone, but as mentioned before impact is avoided both
> > through multi-instance validator deployment combined with validator
> > software diversity.
> >
> > There is a silver lining in all this: the most likely next occurance
> > of this type of situation is January 1st, 2021, as then all kinds of
> > LIR, NIR, or RIR business contracts are likely to start or stop. This
> > gives nlnetlabs and cloudflare almost a full month to figure out a fix,
> > release it, and for operators to deploy it in their networks during the
> > holidays. The perfect excuse to escape any unwanted christmas dinner. ;-)
> >
> > I propose some of us continue discussion at sidrops at ietf.org where
> > through wordsmithing in the draft-ietf-sidrops-6486bis effort so we help
> > any future RPKI implementers from walking into the same problem.
> >
> > Kind regards,
> >
> > Job
> >
> > [1]: https://pkgs.org/search/?q=rpki-client
> > [2]: https://lists.nlnetlabs.nl/pipermail/rpki/2020-December/000238.html
> > [3]:
> > https://blog.apnic.net/2020/11/10/rpki-manifests-securely-declare-contents/
> >
>
More information about the RPKI
mailing list