-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Currently, the draft spec says:
5.4 Simple type IdentifierType
A value of the
IdentifierType
simple type refers to a specific attribute category, attribute, data type, function, notice, status code, combining algorithm or XPath version.<xs:simpleType name="IdentifierType"> <xs:restriction base="xs:string"> <xs:pattern value="[^{}]*({[A-Za-z][0-9A-Za-z]*(-[0-9A-Za-z]+)*}[^{}]*)*"/> </xs:restriction> </xs:simpleType>A value of this simple type is either:
- an absolute URI [RFC2396],
- the name of a short identifier or
- a character string with one or more short identifier names enclosed by curly brackets (i.e.,
{
and}
; U+007B and U+007D) optionally preceded, followed and/or separated by other characters allowed in a URI.
A few issues:
- XSD-aware XML editor (e.g. IntelliJ IDEA) reports an error in the pattern on the second left curly brace: Unexpected start of quantifier '{' . A simple fix is to escape the curly braces:
<xs:pattern value="[^{}]*(\{[A-Za-z][0-9A-Za-z]*(-[0-9A-Za-z]+)*\}[^{}]*)*"/>
- RFC 2396 is obsoleted by RFC 3986 so we may upgrade the reference for the absolute URI (Note: the reference for xs:anyURI has been upgraded to RFC 3986 in XSD 1.1)
- Some values match the regex although they are not supposed to be valid, e.g. empty string, whitespace,
:
,:x
,0:
, ...
We should build the regex by increment, defining one regex for each case at a time:
Case 1: absolute URI (with fragment)
The fragment part is required to support standard XACML datatypes from XML Schema in particular, e.g. https://www.w3.org/2001/XMLSchema#string
.
RFC 2396 is the reference for URI in XACML 3.0. However, it has been obsoleted by RFC 3986 which gives the following ABNF definition for an absolute URI with optional fragment (Appendix A):
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
A corresponding regex can be found here, and the regex for each component in the URI as well (scheme, hier-part, query, fragment). We can reuse that but with a few tweaks:
- The non-capturing group operator
(?:...)
is not supported in XSD pattern therefore all these operators must be removed from the regex; - In XML content (the XSD pattern in this case), '&' is a special character which must be replaced with
&
. - The regex for the
dec-octet
ABNF rule (decimal in IPv4 address) looks wrong because it matches00
,01
,001
, etc. therefore needs to be fixed in the XSD pattern.
The regex excludes curly braces '{}' in particular, which prevents any mix-up with the other two cases of Identifier (ShortId or ShortId-based URI expression).
To be clear, the regex checks the syntax according to the RFC 3986 only, i.e. the generic scheme-agnostic syntax for URI, and nothing more. Any scheme-specific syntax rule is out of scope (defined in other RFCs). It would be impractical to try and capture in a single XSD pattern every scheme-specific rule of every possible URI scheme (every RFC) out there. Ultimately, XACML applications need to implement their own scheme-specific checks if there are schemes they're interested in particularly.
N.B.: the XSD standard has a standard datatype anyURI
for URI validation but is not enough in this case because:
anyURI
represents any URI, and even any IRI in XSD 1.1, including relative URIs which we don't allow here, therefore the xsd regex to match only absolute URIs.- The anyURI type allows characters which are illegal in a URI, provided that they can be escaped to produce a legal URI by using the escaping procedure specified in the XSD standard (e.g. XSD 1.0 refers to the escaping procedure defined in Section 5.4 Locator Attribute of [XML Linking Language]). In other words, the anyURI type may accept these characters, including curly braces
{}
, either escaped or unescaped. Therefore the need for the XSD pattern to exclude{}
as mentioned before. See also Michael Kay's post on stackoverflow (aanyURI
is a wannabee URI).
The regex is quite large, so it seems more readable to define a dedicated XSD type for this case (AbsoluteUriType), and define a union type for IdentifierType that combines with the others, rather than mixing all in one regex for the three cases together. Also this allows to derive the dedicated type from standard anyURI in this case, whereas it is not possible for the second case (ShortId name). Besides, AbsoluteUriType could be used for identifiers that don't support ShortId, such as PolicyId identifiers.
The proposal for a new XSD type for absolute URIs, using the proper regex (as a reminder, an XSD pattern is matched against the entire string, so the ^
and $
anchors are omitted):
<xs:simpleType name="AbsoluteUriType">
<xs:restriction base="xs:anyURI">
<!--
Absolute URI with optional fragment (fragment is needed to support the standard XACML datatypes from XML Schema, e.g. 'http://www.w3.org/2001/XMLSchema#string').
We need to define a pattern for further restriction because the anyURI type is for any URI including relative ones, and actually this type accepts also characters which are illegal in URI, including the curly braces '{}' we want to exclude, provided that they can be escaped to produce a
legal URI according to the escaping procedure specified in the XSD standard (e.g. XSD 1.0 refers to the section 5.4 Locator Attribute of [XML Linking Language]). In other words, the anyURI type may accept these characters (e.g. `{}`) either escaped or unescaped.
Pattern for an absolute URI (with optional fragment) based on ABNF definition of 'URI' in RFC 3986 (de facto, this pattern checks the syntax according the RFC only, i.e. the generic scheme-agnostic syntax, while scheme-specific syntax rules are out of scope, defined in other RFCs):
^<scheme> : <hier-part> (\? <query> )? (# <fragment> )?$
with the following sub-patterns:
- <scheme>:
[A-Za-z][A-Za-z0-9+\-.]*
- <hier-part>:
(// <authority> <path-abempty> | <path-absolute> | <path-rootless> | <path-empty> )
- <authority>:
( <userinfo> @)? <host> (: <port> )?
- <userinfo>:
( <unreserved> | <pct-encoded> | <sub-delims> |:)*
expanded to (after merging <unreserved>, <sub-delims> and ':'):
([A-Za-z0-9\-._~!$&'()*+,;=:]|%[0-9A-Fa-f]{2})*
- <unreserved>:
[A-Za-z0-9\-._~]
- <pct-encoded>:
%[0-9A-Fa-f]{2}
- <sub-delims>:
[!$&'()*+,;=]
- <host>:
( <IP-literal> | <IPv4address> | <reg-name> )
<IPv4address> pattern is included in <reg-name> (any IPv4 address matches <reg-name>), therefore the pattern can be simplified to:
( <IP-literal> | <reg-name> )
- <IP-literal>:
\[( <IPv6address> | <IPvFuture> )\]
- <IPv6address> (<ls32> is factored out from the first 7 lines of the ABNF rule):
(
(
(<h16>:){6}
| ::(<h16>:){5}
| <h16>?::(<h16>:){4}
| ( (<h16>:){0,1} <h16> )?::(<h16>:){3}
| ( (<h16>:){0,2} <h16> )?::(<h16>:){2}
| ( (<h16>:){0,3} <h16> )?::<h16>:
| ( (<h16>:){0,4} <h16> )?::
) <ls32>
| ( (<h16>:){0,5} <h16> )?:: <h16>
| ( (<h16>:){0,6} <h16> )?::
)
- <h16>:
[0-9A-Fa-f]{1,4}
- <ls32>:
( <h16> : <h16> | <IPv4address> )
- <IPvFuture> (leading "v" is case-insensitive, cf. section 3.2.2 (Host)):
[Vv][0-9A-Fa-f]+\.( <unreserved> | <sub-delims> |:)+
expanded to (after merging <unreserved>, <sub-delims> and ':'):
[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&'()*+,;=:]+
- <IPv4address>:
( <dec-octet> \.){3} <dec-octet>
- <dec-octet>:
(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])
- <reg-name>:
( <unreserved> | <pct-encoded> | <sub-delims> )*
expanded to (after merging <unreserved> and <sub-delims>):
([A-Za-z0-9\-._~!$&'()*+,;=]|%[0-9A-Fa-f]{2})*
- <port>: [0-9]*
- <path-abempty>:
(/<segment>)*
- <segment>:
<pchar>*
- <pchar>:
( <unreserved> | <pct-encoded> | <sub-delims> |:|@)
expanded to (after merging <unreserved>, <sub-delims>, ':' and '@'):
([A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2})
- <path-absolute>:
/( <segment-nz> (/<segment>)*)?
- <segment-nz>:
<pchar>+
- <path-rootless>:
<segment-nz> (/<segment>)*
- <path-empty>: empty string (zero character)
- <query>:
( <pchar> | / | ? )*
expanded to (after merging <pchar>, '/' and '?'):
([A-Za-z0-9\-._~!$&'()*+,;=:@/?]|%[0-9A-Fa-f]{2})*
- <fragment>: same as <query>
The pattern requires a colon ':' and forbids curly braces {} therefore this type of string cannot be confused with ShortIdNameType or ShortIdBasedUriExpression.
An XSD pattern is matched against the entire string, so the ^ and $ anchors are omitted.
Since '&' is a special character in XML content, it must be replaced with '&' in the XSD pattern.
-->
<xs:pattern
value="[A-Za-z][A-Za-z0-9+\-.]*:(//(([A-Za-z0-9\-._~!$&'()*+,;=:]|%[0-9A-Fa-f]{2})*@)?(\[(((([0-9A-Fa-f]{1,4}:){6}|::([0-9A-Fa-f]{1,4}:){5}|([0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:){4}|(([0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:){3}|(([0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:){2}|(([0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?::[0-9A-Fa-f]{1,4}:|(([0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::)([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(([0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?::[0-9A-Fa-f]{1,4}|(([0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?::)|[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&'()*+,;=:]+)\]|([A-Za-z0-9\-._~!$&'()*+,;=]|%[0-9A-Fa-f]{2})*)(:[0-9]*)?(/([A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*|/(([A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2})+(/([A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*)?|([A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2})+(/([A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*|)(\?([A-Za-z0-9\-._~!$&'()*+,;=:@/?]|%[0-9A-Fa-f]{2})*)?(\#([A-Za-z0-9\-._~!$&'()*+,;=:@/?]|%[0-9A-Fa-f]{2})*)?"/>
</xs:restriction>
</xs:simpleType>
Possible improvement:
The host part of an RFC 3986 URI may be an IP address (IP-literal
and IPv4address
rules) or registered host name (reg-name
rule), and such hostname doesn't have to follow the syntax for DNS domain names because it is meant to be generic / DNS-agnostic in the RFC, therefore the syntax is far more permissive. I think IP addresses should not be allowed in a absolute URI used as a unique identifier for attributes, categories, datatypes, etc. and the hostname should follow DNS-naming-syntax. My suggestion would be to apply these two rules to the pattern above. Removing the IP address patterns in particular would simplify the regex significantly.
Case 2: regex for ShortId
Reusing the definition from the current xacml 4.0 draft, section 5.3 Element ShortId:
<xs:simpleType name="ShortIdNameType">
<xs:restriction base="xs:string">
<xs:pattern value="[A-Za-z][0-9A-Za-z]*(-[0-9A-Za-z]+)*"/>
</xs:restriction>
</xs:simpleType>
Case 3: regex for a string containing one or more {shortId}s, optionally preceded, followed and/or separated by characters allowed in a URI
In this case, the string is expected to be an expression of an absolute URI, based on short identifiers (the short identifiers are the variables waiting to be expanded in this expression), therefore the suggested name for the XSD type: ShortIdBasedUriExpressionType
.
Based on the ABNF definition of 'URI' in Appendix A of RFC 3986, a regex for matching a character allowed in a URI:
[A-Za-z0-9\-._~!$&'()*+,;=:/?#\[\]@%]
The proposed XSD type for ShortId-based URI expressions:
<xs:simpleType name="ShortIdBasedUriExpressionType">
<xs:restriction base="xs:string">
<!--
Pattern for a sequence of one or more {ShortId}, each one possibly preceded, separated and/or followed by valid URI character(s) (also curly braces are special characters in regex, therefore must be
escaped):
^( <URI_char>* \{ <ShortId> \} )+ <URI_char>*$
with the following sub-patterns:
- <ShortId> (pattern for a ShortId from ShortIdNameType definition):
[A-Za-z][0-9A-Za-z]*(-[0-9A-Za-z]+)*
- <URI_char> (pattern for a valid URI character based on the ABNF definition of 'URI' in Appendix A of RFC 3986):
[A-Za-z0-9\-._~!$&'()*+,;=:/?#\[\]@%]
The pattern requires curly braces {} therefore this type of string cannot be confused with AbsoluteUriType or ShortIdNameType where {} are forbidden.
An XSD pattern is matched against the entire string, so the ^ and $ anchors are omitted.
In XML content, special character '&' must be replaced with '&'.
-->
<xs:pattern value="([A-Za-z0-9\-._~!$&'()*+,;=:/?#\[\]@%]*\{[A-Za-z][0-9A-Za-z]*(-[0-9A-Za-z]+)*\})+[A-Za-z0-9\-._~!$&'()*+,;=:/?#\[\]@%]*"/>
</xs:restriction>
</xs:simpleType>
New IdentifierType definition
The resulting new IdentifierType definition, based on previous types, would be:
<xs:simpleType name="IdentifierType">
<xs:union memberTypes="AbsoluteUriType ShortIdNameType ShortIdBasedUriExpressionType" />
</xs:simpleType>