On a Good Xml Format, ZHubert, and Webservices

As I’ve said before, I am very glad to see that Zack Hubert is making his data available through Xml. The inevitable question comes when you’re dealing with Xml, what schema will you follow? That’s a good question...

But first, let me back up. Why should data be made available though Xml webservices? First, this means I can use your data and you can use mine, and I don’t have to go to the trouble of installing it first on my server. That’s convenience. Second, if I want to use your data, and I have a copy of your data on my server, if you change your data, mine doesn’t get updated. But if I get your data via webservices, then when you change your data I get all the latest and greatest immediately. Third, we get platform independence. Zack wrote his site in PHP. I just wrote a library for getting Zack’s data that parses it and creates nice little objects for me to use in a .NET application. I wrote this in C# (and if anyone wants the library, just ask; I’ll put it up on the web at some point for download, but I can email it before then). It just doesn’t matter that we’re writing our stuff on different platforms. Xml helps solve the platform proliferation issue.

So I think this is a really good thing what Zack is doing. But now we have to deal with the whole schema issue. Zack had two comments on his blog entry on this issue. First, someone mentioned using a TEI standard for the format. I’ve only looked at the TEI stuff for a minute, but I didn’t really see anything that fit. But, I’m going to email the author of the comment right after I do this to see if he sees something that I don’t. I will, of course, let everyone know if something can be found. Another recommendation was OSIS. Now, I really like OSIS quite a bit, but I don’t think their schema handles morphological data. It’s a very rich schema for marking up texts, but it’s not for this kind of data (unless I’m missing something). I would not be surprised if they extended it or came up with a different schema for that. I had a brief discussion with Lowery on the issue at SBL last year, and he said they’re working on schemas for syntactical markup, so I imagine they would cover this kind of data as well. But, for the moment, I’m not sure they have anything that fits.

But also note that, ultimately, we don’t have to pick a particular schema. If there were multiple standards that would work well, then our projects could target all of them. It means more development time, but there’s really no reason to choose one over another necessarily.

So what do we do? I’m open to other suggestions, but in lieu of such a standard we’ll have to come up with one. As much as possible, though, I would say that we stick with OSIS practices. For example, let’s do stuff like use their abbreviations for books (under "Normative Abbreviations for canonical and deuterocanonical books" in their manual) and follow camel casing for multi word elements ("thisisanelementname" vs. "thisIsAnElementName").

But, as I said, I’m open for suggestions. And due to the nature of the beast, this is something that should be decided in community. So, all you techies out there, speak up!