(Mads Freek Petersen)
Metadata files are becoming large, and this is a problem for doing metadata validation on AVS, the limit of 500 MB memory. Golang has very weak support for XML, libxml 2, Id like to have a go XML parser, but I need to be able to do canalization, schemer checking.
A fun exercise to see how difficult it is, a limited context for metadata for eduGain.
Streaming parser that emits the tokens one gets from XML.
You don’t need to do it all at once, a critical part of this can be done in streaming fashion, you parse it and hash it and throw it away.
It’s a lot easier than I thought first, 50 lines of PSP code that make that changes that I needed, about sorting the attributes, the problem I found is, inclusive canalization for signed info and I wanted to find out what that means. Has to do with how namespaces are claimed.
Metadata feed, what is called, 10 thousand integers, 70 MBs, here I ran this saxvalidate PHP and how long it takes.
I: You parse XML emit parts of it and this tag is now complete so I emit this part, you grab it, canalization and get out the hash, but how does hashing work?
M: You don’t wait until the end of the tag, the problem is the start of the tag, and the namespace has to be, you just emit whatever you have and then the hash is also a streaming thing, you can keep adding new material to the hash, and then you can say what the final hash it. In go, you don’t have the streaming hashing. In PSP it’s about 50 lines of code but much easier than I thought and it handles the including and excluding.
I: Probably you shouldn’t do that, there were some issues where you can hide scene data tags, the parser will be confused, parsers make mistakes, it’s always safer to remove the comments.
M: It doesn’t do any fancy XML things, don’t do external things at all, don’t replace entities, just a plain metadata feed, a limited set of XML documents, not the full set of everything.
I: You wrote the parser?
M: No it’s called XML reader, libxml2, bindings to XML 2, not really if you can call it bindings, they have their own API and they use libxml2 behind the scenes.
I: Okay so the code?
shows him the code
It takes care of the signature. Resolves the attributes and sorts them and it emits them, by the long namespace and not the short one, but you have to use the short namespace to generate the hash.
I: If I parse this document and put different namespaces the validation will not work and it’s very bad, and there is a whole paper to have a new canalization algorithm, replaces every affix with a namespace.
In the SAML standard, there’s an extension for XML messages.
M: Emit then and if it’s the last one you have to end the tag, and if there’s nothing, empty element, then emit the name, if it’s whitespace then just replace the symbols with like & with ‘&’. When I take the metadata and when I do it in a streaming way, if from the network then it fails, but if I download the file and try it, it works.
From a go viewpoint, I just needed to find a good parser, we have what we call generative XPath, and if they are not there, an XML file will be created, easier to do it one attribute at the time. That’s why I believe that I might be able to do it in a pure go way.
I: You get the tag you process it and for the next one you do recall, get to a leaf element and then rollback.
M: This is what xmlreader does.
I: I know that python parser can do this, emit tags, and by that we could do something like that.
M: One of the reasons I did that was where Scott said he had this problem with pyff. Unless you create the signature otherwise you wait for the network.
I: XML is complicated, the whole spec defines many things. If you look at it as a canalization format, things become much simpler. It’s not a tree, it’s more than that, it makes things harder to implement.
M: There isn’t a standard to map it to a tree, so everyone makes their own.
M: XPath is the only thing implemented. I like XML and XPath and it’s very nice.
L: If we had JSON path that would be nice, XPath is really nice, implementations can be very clever and it’s very nice to have it.
M: We wanted to have an XML editor for that, the key-value thing and XML, attribute consuming service and there’s a role but I didn’t want to write it twice. You can have an attribute for repeating at a time and then you have an XPath so when we parse and look at the sequence, and the whole run takes 45 seconds, extremely fast.
L: Why would you use the schemer to generate the editor? You have laid out the data that you want, the schemer, which defines how your data is aligned, you generate the editor but you have a single place where everything is defined, and from these, you can generate other things, the other thing you can do to generate an editor a validation but at the same time you can generate the data, once you know the type of the things, it’s very easy to generate the string or array and you get a point to do the generative testing, I run tests and see if things explode, and you don’t have to write single tests.
The other is that everything is still thinking we have these big XML things, need a way to get a signature, you did this with event in an eventive manner, it might not be the same project if you emit the tag it’s the same process of thinking, once you get to a point then the only point is by doing that, emit things that are incomplete. Many people cant think this way, so it’s a very nice thing you have.
In Python we have the SAML library, it’s huge and it’s a bit complex because of Roland.