RSS 1.0 and its taxonomy module
Bringing Metadata back into RSS
Eric van der Vlist March 5, 2001

The ability to describe metadata in syndication vocabularies is key to enable Semantic Web Services.

In this presentation, I will share a practical experience showing how RSS 1.0 and its taxonomy module can be used as a pivot format to carry metadata collected in a classical news format such as XMLNews-Story to RDF or relational databases and XTM Topic Maps.


Eric van der Vlist is a consultant and contributing editor for xmlhack (http://xmlhack.com) and XML.com (http://xml.com).

He has created and maintains <XML>fr (http://xmlfr.org), a French portal dedicated to XML and 4xt (http://4xt.org), a resource site for XT users.

Eric is a seasoned software engineer and active contributor to XML and XSL mailing lists. He is one of the authors of the RSS 1.0 proposal.

He has an engineer degree (B.Sc..) from Ecole Centrale de Paris


Table of Content

Introduction/Conclusion

I have built XMLfr (http://xmlfr.org), a French site dedicated to and powered by XML as a showcase for XML technologies and will use it as a real life example all over this talk.

This dynamic XML/XSLT web site is storing its pages using the XMLNews-Story format.

The site structure is described by a set of RSS 1.0 channels, the semantic information coded in the rich XMLNews-Story inline markup being converted into RSS 1.0 taxonomy markup.

These RSS channels may be consolidated in a RDF database allowing ad hoc semantic queries on global set of articles.

They are feeding RDBMS tables for online real time queries that build a dynamic site index and include navigational information in the XHTML pages sent to the site users.

They can be transformed into XTM Topic Maps to be displayed by Topic Maps visualization systems and be enriched by the statistics extracted from the database to propose topic associations.

About RSS

RSS stands for RDF (or Rich) Site Summary.

Netscape has introduced RSS 0.9, one of the first RDF vocabularies, as a general-purpose site summary vocabulary used to syndicate headlines on their "my.netscape" portal.

It has been rapidly followed by RSS 0.91 with more syndication features but leaving out its RDF syntax.

Both releases are still widely used as a syndication vocabulary used by portals such as Userland, Moreover, Meerkat, … but the vocabulary seemed to have reached a dead end by mid 2000.

After the many additions of RSS 0.91, the language had lost its focus, many requests for improvement were made without any structure and selection process to advance them, these requests were pushing in different directions with a risk to loose still more focus and there was no way to add metadata.

The RSS 1.0 Working Group (Gabe Beged-Dov,  Dan Brickley, Rael Dornfest, Ian Davis, Leigh Dodds, Jonathan Eisenzopf, David Galbraith, R.V. Guha, Ken MacLeod, Eric Miller, Aaron Swartz and Eric van der Vlist) has been created with the charter to define an extensible specification built on a refocused RDF core vocabulary and a mechanism facilitating the construction of specific modules.

This specification (http://purl.org/rss/1.0/) has been published in December 2000 together with a Dublin Core module and a set of supporting tools.

A taxonomy module is under discussion and the format used by XMLfr is based on the current Working Draft.

From XMLNews-Story to RSS 1.0

The RSS 1.0 channels are generated by a XSLT transformation using out of three different sources of information:

The XMLNews-Story "/nitf/body/body.head" element is containing more information than is needed to describe a RSS item including Dublin Core elements such as dc:creator, dc:date, dc:description …

The interesting point is the possible usage of the in-line markup to generate more semantic information.

XMLfr is using three of these elements that are pertinent to its domain: org, person and object.title.

Extracting these elements allows generating dc:object elements to provide a list of keywords.

The goal of the taxonomy module is to replace the words traditionally used within dc:subject elements by unique identifiers.

XMLfr does so by prefixing the element name and element value by a base URI.

Example of item description using RSS 1.0 and the DC and taxonomy modules:

<item rdf:about="http://xmlfr.org/actualites/tech/010222-0001">
  <title>Mises &#224; jour 4Suite.</title>
  <link>http://xmlfr.org/actualites/tech/010222-0001</link>
  <dc:description>Uche Ogbuji a annonc&#233; une …/…</dc:description>
  <dc:creator>Par Michael Smith, xmlhack - traduit par Eric van der Vlist, Dyomedea (vdv@dyomedea.com).</dc:creator>
  <dc:date>2001-02-22</dc:date>
  <dc:subject>4Suite Server, 4Suite, Uche Ogbuji, .../… </dc:subject>
  <taxo:topics>
    <rdf:Bag>
      <rdf:li resource="http://xmlfr.org/index/object.title/4suite+server/"/>
      <rdf:li resource="http://xmlfr.org/index/object.title/4suite/"/>
      <rdf:li resource="http://xmlfr.org/index/person/uche+ogbuji/"/>
      <rdf:li resource="http://xmlfr.org/index/object.title/python/"/>
            …/…
    </rdf:Bag>
  </taxo:topics>
  <dc:publisher>XMLfr</dc:publisher>
  <dc:type>text</dc:type>
  <dc:language>fr</dc:language>
</item>
 

Publishing such a channel allows to get your titles displayed by aggregators such as Meerkat from O'Reilly (http://www.oreillynet.com/meerkat/):

And XMLfr uses these channels internally to display its lists of articles:

RDF databases

RSS 1.0 is fully compliant with RDF and can be directly loaded in RDF databases such as rdfDB or Squish that let you query the set of predicates using a SQL like query language.

Such a query language is very convenient to walk through the entire set of RDF triples and lets you access all the information that is available doing joins between related objects:

load RDF file http://xmlfr.org/actualites/general.rss10 into newrss</>
0
0 </>
select ?item from newrss where
  (http://purl.org/rss/1.0/modules/taxonomy/#topics ?item  ?bag),
  (http://www.w3.org/1999/02/22-rdf-syntax-ns##li 
    ?bag  http://xmlfr.org/index/person/uche+ogbuji/) 
</>
?item
http://xmlfr.org/actualites/tech/010222-0001
0 </>
select ?topic from newrss where 
 (http://www.w3.org/1999/02/22-rdf-syntax-ns##li
    ?bag  http://xmlfr.org/index/person/uche+ogbuji/)
 (http://www.w3.org/1999/02/22-rdf-syntax-ns##li ?bag ?topic)
</>
?topic
http://xmlfr.org/index/org/fourthought/
…/…
http://xmlfr.org/index/object.title/python/
http://xmlfr.org/index/person/uche+ogbuji/
http://xmlfr.org/index/object.title/4suite/
http://xmlfr.org/index/object.title/4suite+server/
0 </>

XMLfr has been running for several months using rdfDB as the backend storage for its dynamic index system, using JrdfDB, a Java interface developed for this purpose that is interfaces with XT to be used by XSLT transformations.

Several features badly needed to be scalable and to develop additional applications are missing from rdfDB:

RDBMS

A RDF database is not needed to keep track of the relations between topics and stories and a RDBMS with a straightforward table design has all the qualities to be used as online backend storage for this purpose.

XMLfr has migrated its dynamic index to a couple of PostgreSQL tables:

test=> \d topics
Table    = topics
+----------------------------------+----------------------------------+-------+
|              Field               |              Type                | Length|
+----------------------------------+----------------------------------+-------+
| channel                          | varchar()                        |   255 |
| item                             | varchar()                        |   255 |
| topic                            | varchar()                        |   255 |
+----------------------------------+----------------------------------+-------+
test=> \d items
Table    = items
+----------------------------------+----------------------------------+-------+
|              Field               |              Type                | Length|
+----------------------------------+----------------------------------+-------+
| item                             | varchar()                        |   255 |
| dcdate                           | date                             |     4 |
| title                            | varchar()                        |   255 |
| description                      | varchar()                        |   255 |
+----------------------------------+----------------------------------+-------+
 

These tables are loaded from text dumps generated by two simple XSLT transformations run against the RSS 1.0 channels.

The dynamic index system is reached through a table of keywords displayed with the articles:

These keywords are linked to pages from the dynamic index displaying lists of articles found in the database:

Topic Maps

A RSS 1.0 channel with taxonomy happens to have all the information needed to generate a XTM 1.0 Topic Map:

<topic id="person-uche+ogbuji">
  <instanceOf>
    <topicRef xlink:href="#person"/>
  </instanceOf>
  <baseName>
    <baseNameString>uche ogbuji (person)</baseNameString>
  </baseName>
  <occurrence id="person-uche+ogbuji-1">
    <instanceOf>
      <topicRef xlink:href="#story"/>
    </instanceOf>
    <resourceRef xlink:href="http://xmlfr.org/actualites/tech/010222-0001"/>
  </occurrence>
</topic>

This Topic Map can be loaded into a Knowledge Server such as empolis K42™:

Topic Maps and Topic Aerial Photographs

This Topic Map is mapping the site content and gives the same picture –under a different syntax-- than the dynamic index system available online.

This picture is then directly derived from the markup used in the articles published on the site and the fact to add a new keyword marked up as "org", "object.title" or "person" is sufficient to create a new topic.

The obvious things that are missing from this Topic Maps are then the topic associations.

However, if we do not know the nature of the topic associations, we may guess their existence by looking at the most common associations found in the articles.

This feature can easily be achieved using the SQL grouping and aggregates and has been implemented on XMLfr through a very simple algorithm: for each topic the list of the 15th topics more often found associated with the current topic is displayed:

The accuracy of this list is surprising.

As another example, Tim Berners-Lee is associated with XML, W3C, RDF, SVG, URI, W3C, XLink, DOM, HTML, HTTP, Java, SGML, Semantic Web, XPath and ISO and I think that it's a fairly good description for such a simple algorithm.

This same algorithm couldn't, unfortunately, be used directly on current RDF databases that are missing aggregates and grouping.

It can be used to generate associations in our XTM Topic Map, though:

      <association id="assoc-person-uche+ogbuji-2">
               <instanceOf>
                       <topicRef xlink:href="#related"/>
               </instanceOf>
               <member>
                       <roleSpec>
                               <topicRef xlink:href="#from"/>
                       </roleSpec>
                       <topicRef xlink:href="#person-uche+ogbuji"/>
               </member>
               <member>
                       <roleSpec>
                               <topicRef xlink:href="#to"/>
                       </roleSpec>
                       <topicRef xlink:href="#object.title-4suite"/>
               </member>
        </association>
 

These associations are similar to the curves that can be seen on an aerial photograph: a human operation is needed to say if it's a road or a river, but I think that they should be usable as a first step to find topic associations.

They have been created, in this Topic Map, as almost anonymous (related/from/to) and could be manually updated to transform the Topic Aerial Photograph into a Topic Map.

         
 

Credits

Many thanks to:

 

References