A little while back I was tasked with solving a problem that went something like this:
Create content in a blog engine and have that content show up in Alfresco transformed to a predefined canonical form.
Blog users, not necessarily within the corporate wall, can author content using familiar tools of choice (like WordPress), something would pull that content into Alfresco. If that content were to be pushed through some approval process then pushed from Alfresco to some content delivery infrastructure, say the corporate web site, then effectively any external blogger can contribute content to a website without the need for VPN setup nor corporate accounts.
Assume a setup where Alfresco is already being used to store enterprise content and it already has a model for content representation. Furthermore there exists a publishing mechanism to push that content to the edge for serving.
Obviously the first thing to do was to look at what Alfresco has out of the box in terms of blog integration. A quick look at the code shows something related to blog integration, and the wiki explains:
This basically allows one to take a piece of content within Alfresco, add some blog specific meta-data to it, and publish it to Typepad or WordPress. This is the reverse of what I was trying to do, so I had find another way.
Basically, the problem can be distilled to: Pull new blog entries from one or more blogs, transform the content to the designated canonical form, then store in Alfresco based on rules (more on that later).
The first thing that came to mind was to check if Mule had an RSS or ATOM transport, and indeed it does. Mule has a community transport for RSS that is able to pull down an RSS feed into ROME feed objects, the transport can be found on the Mule Forge here: http://mule.mulesource.org/display/RSS/Home
All that was needed then is pull down the feed, split it into messages, one message per post. Run it through an XSLT, easily done in Mule, and drop the transformed blog entries into Alfresco over CIFS.
However, that left me with 2 problems: (i) the blog poller needs to be idempotent (don’t pull down the same blog entry twice); (ii) handle custom namespaces/custom fields in the feed.
The first problem was addressed by writing an idempotent receiver inbound router. The router quite simply remembers the date and time of last blog post it received and uses that to pull down newer posts only.
The second problem was a bit tricker to solve. Extending ROME with custom modules is certainly possible, and though it would solve the problem of pulling in custom fields, it’s a bit cumbersome and I would have to update these modules every time the RSS feed source fields change.
What I was really after is segmentation of the RSS feed into individual blog posts, and the transformation of those individual snippets of XML into a predefined canonical form.
So all I really needed was to write a simple XML feed splitter. So another simple outbound router that splits the RSS XML feed into individual posts and a couple of transformers that transform messages from XMLByteArray to JDOM Document and back is all it took to make it happen.
Mule pulled everything together quite nicely with an HTTP connector polling periodically for posts, an XML Splitter segmenting the RSS feed with an idempotent router insuring only new posts make it through. Next was an XML transformation responsible for transforming the blog posts to the canonical representation, and finally a file transport to drop the blog post into Alfresco.