Uploaded image for project: 'XNAT'
  1. XNAT
  2. XNAT-5147

Session builder specifies all XML as UTF-8, which can break non-UTF-8 data



    • Bug
    • Resolution: Unresolved
    • Critical
    • 1.8.5
    • 1.7.4
    • Importer, Prearchive
    • None
    • Rank:


      With the upgrade of XNAT Vagrant to use Ubuntu 16.04 by default, certain data import operations started failing with MalformedByteSequenceExceptions:

      {code:java}Caused by: org.nrg.action.ServerException: Invalid byte 2 of 3-byte UTF-8 sequence.
      at org.nrg.xnat.helpers.PrearcImporterHelper.call(PrearcImporterHelper.java:185)
      at org.nrg.xnat.restlet.actions.SessionImporter.importToPrearc(SessionImporter.java:107)
      ... 111 more
      Caused by: org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 132; Invalid byte 2 of 3-byte UTF-8 sequence.
      at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
      at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
      at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
      at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
      at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
      at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
      at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
      at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
      at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
      at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
      at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
      at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
      at org.nrg.xdat.bean.reader.XDATXMLReader.parse(XDATXMLReader.java:474)
      at org.nrg.xnat.helpers.prearchive.PrearcTableBuilder.parseSession(PrearcTableBuilder.java:38)
      at org.nrg.xnat.restlet.actions.PrearcImporterA$PrearcSession.<init>(PrearcImporterA.java:120)
      at org.nrg.xnat.helpers.PrearcImporterHelper.call(PrearcImporterHelper.java:183)
      ... 112 more
      Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
      at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
      at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
      at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
      at org.apache.xerces.impl.XMLEntityScanner.scanLiteral(Unknown Source)
      at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
      at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown Source)
      at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
      at org.apache.xerces.impl.XMLNSDocumentScannerImpl$NSContentDispatcher.scanRootElementHook(Unknown Source)
      ... 125 more{code}

      I don't know why the 14.04 to 16.04 was the critical factor in making this happen, but failure is actually the right thing in this case: when this has happened, the data has header (0008,0005) SpecificCharacterSet set to ISO_IR 100, which maps to ISO-8859-1. The values extracted from the DICOM are then inserted directly into the generated XML, but the encoding specified in the XML metadata header is UTF-8. This is fine as long as the value is limited to standard ASCII characters, but any extended characters, e.g. º (B0) or é (E9), will blow the parser up because these aren't valid UTF-8 representations (C2B0 or C2E9 respectively).

      The fix is to get the value for the SpecificCharacterSet header for every processed DICOM file. Before the value is written into a bean or XML or whatever, it should be converted from that encoding to UTF-8 then persisted that way. With this, the XML encoding can stay UTF-8 and the values will comply with that encoding.

      The specific character set and value encoding specs outline ways that multiple encodings may be specified. We probably don't need to support multiple encodings but this should be investigated so we don't code our way into a place where we can't later handle the multiple character sets if required.


        Issue Links



              david.maffitt@wustl.edu Dave Maffitt (Inactive)
              jrherrick@wustl.edu Rick Herrick
              0 Vote for this issue
              3 Start watching this issue