The regex works fine but fails when placed in XML schema

I have a simple file doc.xml

that contains one root element with a Timestamp attribute:

<?xml version="1.0" encoding="utf-8"?>
<root Timestamp="04-21-2010 16:00:19.000" />

      

I would like to validate this document against my simple one schema.xsd

, to make sure the timestamp is in the correct format:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="root">
    <xs:complexType>
      <xs:attribute name="Timestamp" use="required" type="timeStampType"/>
    </xs:complexType>
  </xs:element>
  <xs:simpleType name="timeStampType">
    <xs:restriction base="xs:string">
      <xs:pattern value="(0[0-9]{1})|(1[0-2]{1})-(3[0-1]{1}|[0-2]{1}[0-9]{1})-[2-9]{1}[0-9]{3} ([0-1]{1}[0-9]{1}|2[0-3]{1}):[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}.[0-9]{3}" />
    </xs:restriction>
  </xs:simpleType>
</xs:schema>

      

So I'm using the Python lxml module and trying to do a simple schema validation and report any errors:

from lxml import etree

schema = etree.XMLSchema( etree.parse("schema.xsd") )
doc = etree.parse("doc.xml")

if not schema.validate(doc):
    for e in schema.error_log:
        print e.message

      

My XML document fails validation with the following error messages:

Element 'root', attribute 'Timestamp': [facet 'pattern'] The value '04-21-2010 16:00:19.000' is not accepted by the pattern '(0[0-9]{1})|(1[0-2]{1})-(3[0-1]{1}|[0-2]{1}[0-9]{1})-[2-9]{1}[0-9]{3} ([0-1]{1}[0-9]{1}|2[0-3]{1}):[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}.[0-9]{3}'.
Element 'root', attribute 'Timestamp': '04-21-2010 16:00:19.000' is not a valid value of the atomic type 'timeStampType'.

      

So it looks like my regex must be wrong. But when I try to test the correct expression on the command line, it passes:

>>> import re
>>> pat = '(0[0-9]{1})|(1[0-2]{1})-(3[0-1]{1}|[0-2]{1}[0-9]{1})-[2-9]{1}[0-9]{3} ([0-1]{1}[0-9]{1}|2[0-3]{1}):[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}.[0-9]{3}'
>>> assert re.match(pat, '04-21-2010 16:00:19.000')
>>> 

      

I know XSD regexes don't have every feature, but the documentation I've found indicates that every feature I use works.

So what am I misunderstanding and why is my document failing?

+2


a source to share


2 answers


Yours |

fits wider than you think.

(0[0-9]{1})|(1[0-2]{1})-(3[0-1]{1}|[0-2]{1}[0-9]{1})-[2-9]{1}[0-9]{3}

      

parsed as:



(0[0-9]{1})
    -or-
(1[0-2]{1})-(3[0-1]{1}|[0-2]{1}[0-9]{1})-[2-9]{1}[0-9]{3}

      

You need to use more groupings if you want to avoid this; eg.

((0[0-9]{1})|(1[0-2]{1}))-((3[0-1]{1}|[0-2]{1}[0-9]{1}))-[2-9]{1}[0-9]{3} (([0-1]{1}[0-9]{1}|2[0-3]{1})):[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}.[0-9]{3}

      

+3


a source


The expression has several errors.

  • You authorize 00

    as valid month.
  • A|BC

    matches A

    both BC

    - not AC

    and BC

    . Hence, your expression starting with (0[0-9]{1})|

    matches any line containing a 00

    through 09

    . What you want is (0[1-9]|1[0-2])-

    only matched 01

    through 12

    and then a dash.
  • You authorize 00

    as valid day.
  • The sample is not tied to the beginning and end of the text - add ^

    and $

    . This is why your test using Python succeeded.


By the way, why don't you use xs:dateTime

? It has a very similar format - yyyy-mm-ddThh:mm:ss.fff

I think.

+3


a source







All Articles