[RPKI] deep dive on manifest handling (Was: APNIC had an unexpected drop in VRP 00:00 - 02:00)

Tony Tauber ttauber at 1-4-5.net
Wed Dec 2 22:32:43 UTC 2020


Thank you, Job for your excellent and detailed and constructive analysis!

Now get some rest.  :-)

Definitely explains why we saw what we saw (Routinator affected, RIPE
Validator not).
At this point we are moving to pivot to FORT and perhaps rpki-client as
well given the recent experiences.

Thanks again!


Tony

On Wed, Dec 2, 2020 at 4:33 PM Job Snijders <job at ntt.net> wrote:

> Hi all,
>
> First of all to be very clear: there was no 'APNIC outage', APNIC did
> nothing wrong. This was a 'validator outage', and locally outages like
> these can continue to be experienced at any future moment until fixed
> versions are released and deployed. Note: network operators who run FORT
> or OpenBSD rpki-client side-by-side with routinator/octorpki will have
> seen a stable VRP merged set item count on their EBGP routers. In this
> situation RPKI validator software diversity helped the Internet remain
> more stable.
>
> APNIC staff are commendable for having seen an opportunity to implement
> a workaround for this routinator 0.8.1 quirk, but APNIC is just one of
> the tens of thousands of Certificate Authorities in the RPKI ecosystem.
> In short: the observed state of December 1st, 2020 00:00 UTC is an
> expected and normal state in the RPKI ecosystem.
>
> I appreciate George for reaching out to the community to draw more
> attention to the situation, as it seems we can learn from exploring this
> situation in great detail. For many in the community RPKI is a new
> technology. Also it appears a similar issue exists in Cloudflare's
> OctoRPKI, so I notified their developers too about the problem &
> solution. Since there are implementations with a bug in the same
> equivalence class, this case is best handed over to the IETF.
>
> While keeping in mind our human perception of the concept of time
> generally is somewhat incompatible with how time works in the X.509 /
> RPKI crypto world... here are my lengthy debug notes. :-)
>
> TL;DR: the VRP drop is an implementation issue in some RPKI validators,
> can happen again
> solution: wait for fixed version, or run multiple different RPKI validator
> implementations side by side
> there a bit of time pressure: this bug potentially interacts negatively
> with Juniper PR1483097.
>
> Every 20 minutes I copy all RPKI data from the Internet, run rpki-client
> [1], and store the original RPKI data files, the program's execution
> log, and the resulting VRP list as individual ZFS snapshots for
> post-mortem analysis. A copy of my data can be downloaded: it is an
> exact snapshot of all input data from that moment, to replay the event
> in various implementations.
> http://sobornost.net/~job/rpki-20201201-0001-adrian.sobornost.net.tar.gz
>
> Looking at the process' log of December 1st, 2020 run starting at
> midnight for the string 'apnic':
>
>     root at adrian:/tank/rpkirepositories/.zfs/snapshot/20201201-0001# fgrep
> apnic output/log
>     Dec 01 00:00:01 rpki-client: https://tal.apnic.net/apnic.cer: https
> schema ignored
>     Dec 01 00:00:01 rpki.apnic.net/repository: pulling from network
>     Dec 01 00:00:03 rpki-client: rpki.apnic.net/repository: loaded from
> cache
>     Dec 01 00:00:03 rpki-client: rpki.apnic.net/member_repository:
> pulling from network
>     Dec 01 00:00:03 rpki-client: rpki.sub.apnic.net/repository: pulling
> from network
>     Dec 01 00:00:03 rpki-client:
> rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer:
> certificate has expired
>     Dec 01 00:00:03 rpki-client:
> rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/9lv88f3YSSS6iXQmzBvPX6hvnQM.cer:
> certificate has expired
>     Dec 01 00:00:03 rpki-client: rpki.rand.apnic.net/repo: pulling from
> network
>     Dec 01 00:00:04 rpki-client:
> rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/pBp2e-TKxusbiXQjNgwrQ1OsH_s.cer:
> certificate has expired
>     Dec 01 00:00:04 rpki-client:
> rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/ZnMLuaQLNc_lmxGF9iLb0JAMbZA.cer:
> certificate has expired
>     Dec 01 00:00:04 rpki-client:
> rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/yZYCtJIcaINWT0smUVwdY-TPNkQ.cer:
> certificate has expired
>     Dec 01 00:00:04 rpki-client:
> rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/WFBPIARWFTaBikTQvkFutQVej0g.cer:
> certificate has expired
>     Dec 01 00:00:05 rpki-client:
> rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/QmfPXQMASo_v3yE5XQ_oJFSLE8E.cer:
> certificate has expired
>     Dec 01 00:00:05 rpki-client: rpki.sub.apnic.net/repository: loaded
> from cache
>     Dec 01 00:00:05 rpki-client:
> rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/maB2Nu64AHCDMDGWpYxBvsxoj4A.cer:
> certificate has expired
>     Dec 01 00:00:05 rpki-client:
> rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/d0JlIBzwsNjMdvAm-Ir2i1XpkO4.cer:
> certificate has expired
>     Dec 01 00:00:05 rpki-client:
> rpki.apnic.net/repository/B3A24F201D6611E28AC8837C72FD1FF2/0I2GgcK-TUfCopBV9m5olVhGF_c.cer:
> certificate has expired
>     Dec 01 00:00:06 rpki-client: rpki.rand.apnic.net/repo: loaded from
> cache
>     Dec 01 00:00:12 rpki-client: rpki.apnic.net/member_repository: loaded
> from cache
>
> (At the end of the process's run it had observed 62,154 VRPs under the
> APNIC TAL. A CSV & JSON file of the validation process output with all
> VRPs from that moment is also included in the tar.gz file.)
>
> In the above log we see that a number of certificates are expired,
> according to Tom's message [2] these certificates represents APNIC
> members whose membership has been closed. (for example: companies going
> out of business, or merger & acquisition) It is expected for
> organizations issuing cryptographic products to tie business events to
> validity periods in certificates.
>
> For the purpose of these notes I'll focus only on following the
> validation process towards 'ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer' in a manual
> fashion using command line utilities.
>
> After having pulled RPKI from the web (which operationally speaking
> end-to-end is a multi-hour process to get the data from signer to
> validator), a number of process steps have to be performed in order to
> produce a list of Validated ROA Payloads (VRPs). None of these steps can
> be skipped, and the order is important too.
>
> A single manifest file (https://tools.ietf.org/html/rfc6486) actually is
> a bundle of a few things: a start & end date of the file listing, a list
> of filenames and sha256 hashes, and a EE certificate (which also has its
> own embedded start & end date!), a serial number, and references to
> other things such as which entity signed it.
>
> The first step is to figure out whether a given manifest file is 'valid'
> (are the signatures right) and 'current' (the timestamp on the
> validator's wall clock is between both the manifest's embedded start &
> end date AND the EE certificate validity dates), and the 'latest'
> (should the validator have to choose between two versions of the file,
> both valid and current, pick the one with the highest number).
>
> So at December 1st 00:00:03 UTC, the manifest's start & end date, and
> the EE certificate's start and end date were:
>
>     $ tar fxz rpki-20201201-0001-adrian.sobornost.net.tar.gz
>     $ cd 20201201-0001/data/
> rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2
>
>     $ ls -lahtr DmWk9f02tb1o6zySNAiXjJB6p58.mft
>     -rw-r--r--  1 job  wheel   214K Nov 30 23:01
> DmWk9f02tb1o6zySNAiXjJB6p58.mft
>
> This file's ctime appears to be November 30th, 23:01
>
>     # check manifest's econtent start & end date
>     $ strings DmWk9f02tb1o6zySNAiXjJB6p58.mft | head -2
>     20201130230107Z
>     20201202230107Z
>
> December 1st 00:00:03 is between November 30th 23:01:07 and December 2nd
> 23:01:07: check!
>
>     # check the manifest's embedded EE certificate start & end date:
>     $ test-mft -vp DmWk9f02tb1o6zySNAiXjJB6p58.mft | openssl x509 -text |
> grep -A2 Validity
>         Validity
>             Not Before: Nov 30 23:01:07 2020 GMT
>             Not After : Dec  2 23:01:07 2020 GMT
>
> December 1st 00:00:03 is between November 30th 23:01:07 and December 2nd
> 23:01:07: check!
>
> With the dates and signatures of the manifest file check out to be 'all
> lights green', the next step is to process the manifest's file listing.
> A manifest 'file listing' is checked through two steps:
>
>     - is the listed file present?
>     - is the sha256 hash (in base64 format) listed on the manifest the
>       same as the sha256 hash computed by the validator using a copy of
>       the listed file?
>
>     # looking at manifest file listing:
>     $ test-mft -v DmWk9f02tb1o6zySNAiXjJB6p58.mft | grep -A1
> ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
>     95: ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
>         hash YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc=
>
>     # checking whether file is present:
>     $ ls -alhtr ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
>     -rw-r--r--  1 job  wheel   1.5K Nov 30 23:01
> ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
>
>     # compute sha256 hash of the file
>     $ sha256 -b ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
>     SHA256 (ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer) =
> YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc=
>
> Indeed, the 'YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc=' hash computed
> from the referenced certificate file is the same one as listed in the
> manifest file (which we inspected with test-mft)! Note that at this
> stage of the validation process the 'ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer'
> file has not been processed in any other way other than the equivalent
> of that 'sha256' OpenBSD utility.
>
> These 'jumps' from certificate to manifest to certificate using hashes &
> signatures serve multiple purposes: by first confirming a hash matches,
> the validator does not (yet) need to attempt any file content parsing
> (which would potentially be sensitive computing operations on an at that
> point in time a unknown and potentially dangerous file), and secondly:
> by checking the presence and hash of each file, the publication point's
> completeness and integrity is confirmed. Missing .roa files can result
> in network outages [3].
>
> At this point the manifest file has been completely processed, the next
> step in the validation process can commence. Each and every referenced
> file is opened by the validator, embedded certificates and sigantures
> are verified, and then again file contents processed (could be
> manifests, certificates, CRLs, or ROA files).
>
> Let's inspect ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer:
>
>     $ openssl x509 -in ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer -inform DER -text |
> grep -A2 Validity
>         Validity
>             Not Before: Oct 23 10:14:32 2019 GMT
>             Not After : Dec  1 00:00:00 2020 GMT
>
> As the validator's wall clock was December 1st 00:00:03, we can see that
> ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer expired '3 seconds ago'. Note that
> before we observed that creation time on the manifest file which
> referenced this .cer file was November 30th, at that time this
> ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer certificate was valid, present, and
> current!
>
> One could say that ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer is a child of
> DmWk9f02tb1o6zySNAiXjJB6p58.mft. The ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer
> file might not even be under control of the entity which generated
> DmWk9f02tb1o6zySNAiXjJB6p58.mft. A child's expiry does not result in the
> death of the parent. If a validator considers all referenced files on a
> manifest to be invalid, solely because *upon further inspection* a file
> contained contained an expired EE certificate, I'd say it is an
> 'overreaction', a simple software defect. After all, there was a valid
> current manifest which listed a hash and that hash matched the file, so
> the file became eligible for X.509 certificate validation in the first
> place!
>
> It appears that Routinator conflates two distinct steps in the
> validation process:
>
>     step 1) checking the validity of a RPKI manifest
>     step 2) checking the validity of a file referenced from the in step 1
> validated manifest
>
> A valid manifest referencing a (now expired) certificate is a legitimate
> state of being. What is not valid is for the manifest listing itself to
> be expired, or the manifest's EE certificate to be expired, or its CRL
> to be expired, or its parent certificate to be expired, or for any files
> listed on the manifest to be missing, or for any sha256 hashes to be
> different than listed on the manifest.  Phew....  that's a mouthful of
> conditions! We're gonna have to work in IETF to capture this in simpler
> english.
>
> Conclusion
> ==========
>
> I'm not saying validators should accept expired data, they shouldn't!
> But it is *expected* that Certificate Authorities (like LIRs, NIRs, or
> even RIRs) set the expiration dates on cryptographic objects to be
> aligned with the reality of business contracts. This is a *critical*
> feature of the RPKI and makes RPKI superior to IRR data: finally there
> /are/ expiration dates on the equivalent of 'route:' objects.
>
> A repeat of the 'december 1st' VRP drop situation can come into
> existence at any future moment under any Trust Anchor, under any
> Certificate Authority. Simply put: network solely relying on current
> versions of octorpki or routinator are somewhat at risk when billing
> cycles end. Also, I do not recommend downgrading to older versions
> because of https://www.nlnetlabs.nl/projects/rpki/security-advisories/
> (which perversely is a bug that *is not* resolved with rpki software
> diversity).
>
> I suspect it is OK for network operators to choose to sit this one out
> and just wait for a fixed version, provided it can be released in a
> manner of weeks. Because of Juniper PR1483097 (which probably still
> affects many currently deployed internet routers) the complete
> disappearance of VRPs can negatively impact internet traffic forwarding
> in the default-free zone, but as mentioned before impact is avoided both
> through multi-instance validator deployment combined with validator
> software diversity.
>
> There is a silver lining in all this: the most likely next occurance
> of this type of situation is January 1st, 2021, as then all kinds of
> LIR, NIR, or RIR business contracts are likely to start or stop. This
> gives nlnetlabs and cloudflare almost a full month to figure out a fix,
> release it, and for operators to deploy it in their networks during the
> holidays. The perfect excuse to escape any unwanted christmas dinner. ;-)
>
> I propose some of us continue discussion at sidrops at ietf.org where
> through wordsmithing in the draft-ietf-sidrops-6486bis effort so we help
> any future RPKI implementers from walking into the same problem.
>
> Kind regards,
>
> Job
>
> [1]: https://pkgs.org/search/?q=rpki-client
> [2]: https://lists.nlnetlabs.nl/pipermail/rpki/2020-December/000238.html
> [3]:
> https://blog.apnic.net/2020/11/10/rpki-manifests-securely-declare-contents/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlnetlabs.nl/pipermail/rpki/attachments/20201202/fad563ae/attachment-0001.htm>


More information about the RPKI mailing list