John Wilbanks, VP of Science for Creative Commons, gave O'Reilly Media an exclusive sneak preview of a joint announcement that they will be making with Microsoft later today at the O'Reilly Emerging Technology Conference.
According to John, who talked to us shortly after getting off a plane from Brazil, Microsoft will be releasing, under an open source license, Word plugins that will allow scientists to mark up their papers with scientific entities directly.
"The scientific culture is not one, traditionally, where you have hyperlinks," Wilbanks told us. "You have citations. And you don't want to do cross-references of hyperlinks between papers, you want to do links directly to the gene sequences in the database."
Wilbanks says that Science Commons has been working for several years to build up a library of these scientific entities. "What Microsoft has done is to build plugins that work essentially the same way you'd use spell check, they can check for the words in their paper that have hyperlinks in our open knowledge base, and then mark them up."
Wilbanks told us that in addition to open sourcing this plugin, they are also open sourcing a plugin that allows an author to easily assign a Creative Commons license to a paper. He says that this will allow anyone to come along and hack the current plugin, which is largely directed toward the life sciences, and extend it to work with different databases in different disciplines, as well as to allow for the publication of the data embodied in Word documents in standard XML formats used in the sciences.
The plugins have a lot of knowledge about how Word style sheets work, but John believes that because the sources are being released under the MSPL license, they can mine the code to create similar functionality on other tools, such as Open Office, or use the existing "well baked" free software that integrates into the Science Commons data.
"This makes it easy to add scientifically accurate, persistent hyperlinks to articles. Right now, it's really hard to put hyperlinks into your articles that leverage databases. And what this does is make it dead easy. The hope is that by creating these forests of hyperlinks, things like Google actually start to work. Right now Google doesn't work very well on the scholarly literature, because there are no hyperlinks. You can also start to do relevance based searching and aggregation."
He believes that this will bring the power of semantic markup to the scientific masses. "Right now, if you want to use ontologies, if you want to use semantics, you've got to be kind of an alpha geek."
The work on Microsoft's part was done by Microsoft Research. "One of their mandates is to really work closely with the academic and scholarly publishing community. I've been talking off and on with guys like Tony Hey ... about just how important it is to get the semantics in, and he's really bought into that, that we need to connect semantics in, but it has to happen at a really broad scale, we need to get not only the semantic web alpha geeks, but also the everyday scientists using this stuff."
According to Wilbanks, Microsoft was already working on the technology with people at the University of California at San Diego, especially ontology guru Philip Bourne. When the Science Commons team heard about the work, they suggested that "it might be nice, instead of one ontology, to use 75 ontologies at once." John notes that Science Commons already has an integrated set of ontologies and taxonomies and sources, and that it allowed the Microsoft and UCSD team members to "turbo-charge" the work they were already doing.
They had already done a lot of work, including a plugin that converts a document to PubMed Central compliant XML. Wilbanks says that when you put it all together, it starts to allow for some lightweight but very powerful semantic publishing inside Word. "What they did was to take the work they had been doing and tie it to the open databases that we had built."
He also is excited about the fact that he has gotten agreement to a stable and open naming system. Right now, he says, a given gene may be identified in myriad ways in different systems. Wilbanks says that one piece of the new initiative that Science Commons is bringing to the table is a new project called the Common Naming Project. He calls it a domain naming system for the sciences. They hope to get fairly rigorous and stable naming conventions out into the hands of the community.
Wilbanks said that Word is, in his experience, the dominant publishing system used in the life sciences, although tools like LaTex are popular in disciplines such as chemistry or physics. And even then, he says it's probably the place that most people prepare drafts. "almost everything I see when I have to peer review is in a .doc format."