Wellnhofer is incoherent
If his concern is about loading schemas over the network, he should disable unencrypted HTTP as well.
More than a decade after implementing support for secure HTTPS connections on its website, the World Wide Web Consortium (W3C) is finally planning to begin redirecting insecure HTTP connections to the more protected spec. The organization, which gets hundreds of millions of requests per day to its website, had delayed that …
1. Are you offering to write and maintain the HTTPS support, as suggested by Wellnhofer? If no, then why complain?
2. If you followed the link and read it, you would see a couple of points. Wellnhofer wrote a year ago that "adding HTTP support directly to the library was a mistake". And six months ago he wrote "I'll probably start by disabling HTTP and FTP support by default.".
He is far from being incoherent. He has made his position clear. You can always fork the code and provide your own solution instead of lobbing in uninformed insults.
Correct, and not the only instance where libxml2's bugs are dismissed with contradictory excuses.
Also, lack of HTTPS support affects not only validation but a whole bunch of other issues as well, where URLs are not just opaque identifiers. One example that pops to my head is applying XSLT transforms to remotely hosted documents where this means the difference between a simple .xslt and having to roll out an entire bunch of ad hoc logic to download, process and delete the source files, with the corresponding error checking, etc.
Plus, the library is quite outdated anyway, what with no XPath / XSLT 2.0 support. They'd probably be better off deprecating the library in favour of better maintained solutions.
I get your drift, but if we had software running within my place of employment that was unmodifiable, and hard-coded to load static files from external sites, I'd first try and get it removed.
Failing that, I'd not use a proxy - I'd set up a "spoofed copy" of the site, and populate it with the relevant files, and audit them.
Then make sure the hard-coded software was using a suitably modified DNS to access this internal clone, and that only. No way would I allow any software to depend on external sites for anything, unless management overrode me in writing.
It's not just security - reliability / support would be issues too!
(Not my downvote!)
Tech A "you should update that." Maintainer, "you're right". Tech B "That broke my deployment" Maintainer "Fix your shit" Tech B "NO"
Maintainer to tech A "Sorry, you can't have nice things because that guy is an idiot."
The lazy ones already hacked together something with SQUID. The smart ones are either also lazy or unwilling to do a strict validation test over an unvalidateable connection.
Corollary: We are all using the part of the Internet built for the dumbest half of the web developers. That's the whole internet. No other part exists.
On the other hand though we're talking about HTTP redirect upgrades, so a MitM attacker can still just return what they want in place of said redirect, unless the libraries also support HSTS and its implemented correctly. Which considering the hassle getting them to support redirects and HTTPS at all seems unlikely in my cynical mind.
If consumers are willing to change all the external references to HTTPS they can presumably already do that.
Without one of those, this change causes issues without actually solving anything from what I can tell? Apologies if I'm missing something obvious.
Critical component "A" REQUIRES JDK 8.
Without critical component "A", none of us have jobs and the owners get sued for failing to deliver.
Higher level JDKs are NOT an option at this time, and even the NEWER versions of that component require JDK 11 as their portability base, so they will NOT be using newer functionality.
I don't envy the anonymous source. It looks like he has infrastructure that handles internal communications, but relies on the Internet being up, and a remote third party service being available, and the third party service not deciding to switch protocols in a way that's hard for him to implement. He got caught by the third condition today, but the other two are no less problematic.
I'd have just stored the schemas locally. I really don't like the idea of my application automatically and silently updating bits that I don't control and that can affect its behavior.
At the very least, use a local store as a fallback for when the remote service is unavailable.
One problem is a vast corpus of extant XML documents with embedded schemaLocation attributes that specify http-scheme URLs, particularly when you don't control those documents – for example, because a partner sends them to you for automatic processing.
For that matter, you might have a schema document stored locally which itself references http://www.w3.org/2001/XMLSchema.xsd as its own schema. The W3C still provides http-scheme URLs as the official ones for various schemas and DTDs in many places.
Updating all of those to refer to local copies instead could be quite a lot of work. And for signed XML documents, it's a non-starter.
Now, you could use the proxy-interception technique that others have discussed to serve those documents from a local server, which would be faster and safer than fetching them from the W3C. That seems like the most plausible solution to me. But it's not trivial for many organizations which don't already have expertise in that area.
> It's surprising that modern software that makes HTTP requests wouldn't have the ability to handle redirects or https.
It absolutely is not.
> Please make sure your software is up to date...
I believe I speak for all sysadmins everywhere when I say: No. Stop breaking [y]our shit.
HTTP is easy. SSL is hard. (Sorry, I mean TLS or OPP or whatever they may have arbitrarily renamed it to this week).
"Surprised?" Surprised that systems haven't been updated, where previously they sent 3 lines of text over a socket but now those lines must be wrapped in a TLS protocol, with manual-or-scripted updating of signed certificates from a third-party server (almost certainly directly or indirectly controlled by Google unless you have bags of cash) and a perpetually-updating-for-security-reasons stack of libraries to handle the ever-shifting realities of which protocols have turned out to be insecure after all?
Nobody should be surprised that going from something dirt simple to something incredibly complex is taking a long time.
To say nothing of a world where your certificate authority can, any time they want, make you disappear...for 100% of the resources on the internet. Yeah I'm sure that won't come back to haunt us in any way whatsoever. This is fine.
Not sure how many people have worked with HTML
HTML doc may reference various schemas, namespaces that define properties of the document
They can be full of these sort of references e.g.
Note they are specifying a http url and it can be remote (most XML docs will by default have a few references to w3 urls as one of my examples does above)
Issues come if you do validation of the XML...
Then s/w will attempt to open various defined http urls and validate parts* of the document against them.
.. which is where the fun starts
If it's old software then quite possible the read of http doc may fail if its then redirected to https - especially if code is such that url is inspected and then http or https connection called depending on that (was often the case in old s/w)
If it copes with redirect may be other fun where it validates url it ends up at against the expected url and match fails due to http / https differences.
.. and you really want to validate your doc as a doc may be well formed XML but not match the referenced schemas e.g. doc has 2 instances of element <widget> but schema specifies this as max of one occurrence, without
* may be validating whole doc, may be validating just those of a certain namespace, depends what's defined)
And here we are again : production systems that cannot sustain their operation on their own.
It is mind-boggling that we went from a mainframe mentality where all code was documented and accounted for, to a "I'll take any piece of code that suits me from anywhere without checking" mentality and, above all, that that mentality is now, apparently, a standard.
No wonder Russian and Chinese hackers are making mint.
I'm not even sure why it needs to be downloaded if keeping a local copy would have done the job perfectly well.
Software which gets confused by a redirect isn't going to be able to cope if the W3C changes something in a DTD or what have you.
Because that's how the examples and tutorials did it, and indeed what they recommended, for schemas and DTDs and the like. And now we have vast corpora of documents and myriad interoperating business systems which use those references.
This article really boils down into "Not all UAs are browsers", a lesson that most developers haven't learned, and in many cases apparently can't understand.
a mainframe mentality where all code was documented and accounted for
I work with a lot of mainframe-using sites, and this is a hilarious fantasy. I don't know how many times I've heard that they don't know which version of the source built the binaries they're running; or what parts of their vast archive of source code they actually use; or even that they're sure, after investigating, that they've lost the source code, and could we recommend a decompiler?
Some mainframe shops are tight ships. Many are not.
And a great many exchange data with third parties, and a lot of that is XML, and they're in this same boat – except they're probably using IBM parsers which I believe support redirects and HTTPS (though I haven't bothered to confirm that).
"To have greater certainty that you are receiving the actual W3C documents, and they haven't been modified in transit."
Firstly (beyond the specific W3C case), so many web hosts use free LetsEncrypt certificates which offer no authenticity guarantee as anyone can get a certificate with minimal verification of identitity and none of probity.Secondly, modification in transit is actually a pretty rare attack vector compared with host and library element contamination (which HTTPS doesn't protect against as the attack comes from a certified source).
The overriding downsides however of forcing everything to HTTPS is the vast task of amending the mass of existing web sites that don't actually require to be ultra-secure, and the potential for shutting out those users stuck for legitimate reasons with less up to the minute browsers when encryption gets 'updated'. In both cases the purpose of the web is thwarted, and enforced HTTPS can also make security scanning more difficult to accomplish reliably.
So by all means make HTTPS available, but keep the options flexible and open so hosts can choose for themselves what to do. I'll probably get down voted for this, but it's important to be clear about what the web is really there for - to serve content (ideally system agnostically as was originally intended) or to "be secure" whether or not that limits legitimate use of the resource. As a security professional, I am permanently conscious of the need to balance these two factors - the most secure computer is the one that's switched off, but it's no use to anyone in that state.
"so many web hosts use free Let's Encrypt certificates which offer no authenticity guarantee as anyone can get a certificate with minimal verification of identity and none of probity"
Certificates serve multiple purposes. In the cases we are talking about, it can be broken down into three categories (there are more):
- Encryption: The certificate provides the key set required to encrypt and decrypt a traffic stream. This does NOT require identity validation etc. It is merely end to end encryption.
- Identity validation: This effectively means that you are trusting the providing authority who issued the certificate to validate that the owner of the certificate (that is used on the site you are visiting) is who they say they are. The only reliable certificates for this are the Extended Validation (EV) or better certificates, typically represented in browsers by a full green bar in the certificate area. As the name implies, this is representing the assumption that the certificate issuer has done background checks etc. to validate that the entity is, in fact, who they say they are. the caveat here is that if you do not trust (as a person) that the vetting organisation is doing the right thing (e.g. Symantec), it is worthless for that purpose. These are also generally, more expensive than other types.
- Code-signing: Similar to encryption, but for the purpose of determining if the code has been altered from when it was signed. The same methods are used for DKIM and the like.
So saying that LE certs are unfit for purpose for transport encryption is incorrect.
The only reliable certificates for this are the Extended Validation (EV) or better certificates, typically represented in browsers by a full green bar in the certificate area.
2010 would like its myth back.
EV certificates were a CA scam. In practice the EV requirements have been shown to offer little additional security. Chrome stopped signaling the EV/OV/DV difference to users years ago, on the grounds that most users had no idea what it meant.
And considering the huge list of trusted CAs that most browsers ship, referring to any sort of server certificate as "reliable" is, well, laughable.
LE and other zero-cost CAs have certainly been used for typosquatting and other confusion attacks, but if you can get an LE-issued certificate for w3.org I'd be impressed. And since we're talking about URLs embedded in existing and automatically-generated documents, for the vast majority of cases, typosquatting isn't a viable attack in this situation.
Certainly PKIX is a horrible mess (albeit a somewhat less horrible mess following the broad adoption of Certificate Transparency), and X.509 itself is a horrible mess. And certainly the mere presence of a certificate which a typical browser will accept proves very little, though it does usually mean adequate protection against passive eavesdroppers for that connection. (Frankly, this was true before LE and the HTTPS Everywhere movement, because browsers ship with such a huge list of trusted CAs, many of them dubious or potentially subject to coercion.)
That doesn't mean that enforcing HTTPS for w3.org would have no benefit, however. On the other hand, it does come with a cost. The same is certainly true of HTTPS Everywhere, which provides some protection against, for example, script-injection via DNS hijacking when on untrusted networks (the classic browsing-in-a-café attack); but as you say penalizes many small sites which have no information that requires privacy.
Signing is irrelevant, because the UAs in this case – XML parsers – don't know to check for a signature; so if they received a malicious schema or DTD document without a signature, they'd proceed to parse it.
And there would be no point in making XML parsers require signed schemas. (DTDs aren't XML, and AFAIK there's no specification for signing DTD documents, so they're out of the question anyway.) Many organizations create schemas for all sorts of purposes, and requiring signatures would have been a prohibitively expensive step,1 so schemas simply wouldn't have been used. People who wanted validation would have stuck with DTDs or an alternative XML schema mechanism (e.g. Schematron, which was a real thing).
Now, you can certainly argue that we'd be better off if XML Schema had required signing, and therefore had withered on the vine. But the XML Schema Working Group wouldn't see it that way, so there was never any incentive for them to consider requiring signatures.
Also, XML Schema predates XML Signature – XSD 1.0 in 2001, XML Signature Recommendation in 2002.
1I've worked extensively on a production code-signing system, and read a great deal of the academic and industry research on code signing. This would effectively be an application of code signing. Code signing is a big problem for a lot of organizations, and even more so for developers. Requiring signatures for schemas would have hugely increased the cost of using schemas.
Certainly fetching a schema or DTD by plaintext HTTP opens the client up to denial-of-service; just fail the request or serve a document that rejects valid inputs. Whether it's worth compromising HTTP (e.g. by DNS cache poisoning) to do that in the general case is dubious, but for specific targets it might well be worth some attacker's while.
Schemas are themselves XML documents, so if someone is using a validating parser that has external-entity support enabled, hijacking a request for a schema could be used to mount an XXE attack. That's just one example of a more-dangerous attack.
Broadly speaking, for a lot of applications a validating parser can be leveraged into an HTTPS-to-HTTP downgrade attack over multiple connections (to distinct entity servers). That's not as easy to chain as a downgrade over a single connection (or over a multiple-connection link to a single entity server), but it's still a vulnerability.
That said, I don't offhand recall any discussions of this being seen in the wild, at least not specifically against resources hosted at w3.org. But that doesn't prove anything either; I may have missed it, or it may not have been published, or attackers might not have been exploiting it before but could in the future.
Oh this is familiar! A couple of salient points here may give context.
First, the Java URL class does not handle redirects across schemes: if an HTTP request returns a 301 or 302 redirect to an HTTPS URL, it will not be followed and the request will terminate instead. This is partly due to the API design of these particular classes, some of Java”s worst. It means Jave isn’t going to handle this behind the scenes.
Second, the W3C throttle requests to the URLs used as public identifiers for their schemas. Sensible as they would probably see many millions of requests a day otherwise. So any XML based process that needs them likely has to have some sort of local version, which means there’s some manual processing going on there too. The schemas are not just for validation, they define entities too such as , and in many cases you’ve gotta have them.
Summation: while this sounds a bit daft, there are some real issues here which aren’t obvious unless you’ve spent time in this particular quagmire. I have some sympathy for your source for this story.
W3C is so engrossed in generating new specs and changing old ones that it can't tell you -- or perhaps even itself -- if a given website is W3C-compliant. It has a web page browser, "Amaya", which you can use to self-check your web pages' W3C compliance.
The latest version, per the Amaya home page is: "Amaya 11.4.4 (18 January 2012)"
They would need to make a copy of Chrome only without Google's hundreds of programmers and billions to spend on crazy ideas which should never really be in a web browser. And as Chrome has the market wrapped up, web developers would ignore it anyway as it seems only testing with one browser has been part of their job description for a couple of decades.
Compliant, how quaint. Ever put a DIV inside a P or a SCRIPT or STYLE not in the HEAD? That’s not compliant, but it doesn’t really matter - HTML5 has deterministic parsing for any input, so however badly the page is formed, the content will be parsed and rendered in a predictable way.
Remember HTML is now a whatwg spec rather than W3C.