How To Build a Universal Feed Reader

We will detail the steps in building a feed reader recognizing all formats, by using the possibilities of XML PHP 5. The knowledge of the structure of an RSS file is essential for this study.

This extensible page is built with Ajax Extensible Page from Xul.fr.

Structure of an RSS file

Any syndication file contains a list of items, articles, notes or other documents, and a description of the site which is the source that is known as the channel. For the channel as well as the elements, we shall provide a title and description, as well as a URL.

Articles or documents

In all formats, basic data are included: the link on the article, its title, and a summary.

<item> 
    <title>RSS Tutorials</title>
    <link>http://www.scriptol.com/universal-reader.php</link>     
    <description>Tutorials for building and using RSS feeds</description> 
</item>

The name of tags are different depending on the format used. Other data can be provided as the author, a logo, etc.

The channel, or website providing contents

The feed includes a description of the source, thus the site where the documents were published. Its URL, the title of the home page, a description of the site.

<channel>
    <title></title>
    <link>http://www.scriptol.com/</link>     
    <description></description> 
<channel>

Here again, the name of tags depends on the format used.
The items of articles are placed after the description of the channel, as seen in the various formats below.

Differences between formats

An overall difference between RSS 2.0 and Atom is that the uses the rss container, and Atom, and only the channel. Other differences are the names of tags.
Regarding RSS 1.0, which is based on RDF, the syntax is far from those of the two other formats.

Format RSS 2.0

The example is based on that of the specification of the RSS 2.0 standard from Harvard.

View the example

Format RSS 1.0 based upon RDF

The format 1.0 uses the same tag names that the 2.0 which will facilitate the construction of a universal reader. However, there are differences in structures. Firstly, the container rdf belongs to a namespace of the same name. The structure is defined in the channel tag, but the descriptive elements are added after it.

The example below is based on the specification of the standard RSS 1.0.

View the example

Even though the format is more complex, using it remains simple with the XML and DOM functions of PHP.

Structure of the Atom format

The Atom format uses directly the channel as root container. The tag of the channel is feed and elements are entry.

View the example

As one sees Atom uses its own tag names while the two RSS format share same ones. What we harness to identify the format of a feed file.

Using DOM with PHP 5

The Document Object Model can extract tags in an XML document or HTML. We will use the getElementsByTagName function for a list of tags whose name is given as a parameter. This function returns a list in DOMNodeList format, which contains elements format DOMNode. It applies to the whole document, or a DOMNode element and thus extract parts of the file, the channel or an item, and in this part a list of tags.

Extracting the RSS channel
DOMDocument $doc = new DOMDocument("1.0");
DOMNodeList $channel = $doc->getElementsByTagName("channel");

We will use the parameter "feed" for the channel. Note that the class names are for informational purposes, the PHP code does not use them.

Extracting the first element
DOMElement $element = $channel.item(0);

You can assign a DOMElement rather that a DOMNode directly at the call of the item() method which returns an DOMNode. The advantage is that DOMElement has attributes and methods to access the contents of the element.

Extracting all elements
for($i = 0; $i < $channel->length; i++)
{
    $element = $channel->item(i);
}
Using data element

For each item, as the canal, components are extracted with the same method and with the firstChild attribute. For example, the title:

$title = $element.getElementsByTagName("title");   // getting the list of title tags
$title = $title->item(0);  // getting one tag
$title = $title->firstChild->textContent;  // getting its content

Wihtout a method for extracting a single element, getElementsByTagName is used to extract a list that actually contain one element, and by using item, we get this element.
In XML, the content of a tag is treated as a child node, so we use the property firstChild to get the content of an XML element, and data for the text content.

It remains to apply these methods on the channel and on each element of the feed to retrieve its contents.

For a more general use, the function returns the contents implemented in a two-dimensional table. It will then be the choice of the programmer to display directly it in a Web page, or perform some treatment on the table.

How to identify the format

Identifying the format is very simple if we know that RSS 1.0 and 2.0 use the same tags, and therefore that the same functions could apply to both formats. We recognize Atom by the feed container, while RSS 2.0 uses channel and 1.0 uses rdf.
Because both RSS versions use the channel tag, the feed tag is enough to recognize Atom.

DOMDocument $doc = new DOMDocument("1.0");
DOMNodeList $channel = $doc->getElementsByTagName("feed");
$isAtom = ($channel != false);

We do try to extract the feed tag. If the interpreter finds this tag, the DOMNodeList will contain an element. The isAtom flag is set to true, otherwise we will treat the feed as RSS format without distinction.

Reading data channel

We know how to extract the channel. The same function can be used with the string "feed" or "channel" as parameter. It is assumed that the pointer to the document is the global variable $doc.

function extractChannel($chan)
{
   DOMNodeList $channel = $doc->getElementsByTagName($chan);
   return $channel->item(0);
}

We can then with the following function, called with the name of each tag in parameter, read the title and the description of the channel.

function getTag($tag)
{
   $content = $channel->getElementsByTagName($tag);
   $content = $content->item(0);
   return($content->firstChild->textContent);
}

We then call the function with successively as parameter "title", "link", "description"...
The names depend on the format, it will be "summary" for Atom and "description" for the others.

Reading data elements

The principle will be the same, but we will have to loop in a list of items while there is only one channel.

We must also take into account the fact that RSS 1.0 put descriptions of the elements out of the channel tag while they are contained inside in other formats. The items are contained in feed in Atom in channel in RSS 2.0, but in rdf: RDF in RSS 1.0.

The function extractItems extract the list of elements, it has the parameter "item" in RSS and "entry" in Atom:

function extractItems($tag)
{
   DOMNodeList $dnl = $doc->getElementsByTagName($tag);
   return $dnl;
}

The returned list is used to access each item. He is pushed into the array $a. Example with the RSS format.

$a = array();
$items = extractItems("item");
for($i = 0; $i < $items->length; i++)
{
    array_push($a, $items->item($i));
}

One can also directly create an array of tag of an item: title, link, description for each item and place it in a two-dimensional table.
To do this, we do use a generic version of the getTag function defined earlier:

function getTag($item, $tag)
{
   $content = $item->getElementsByTagName($tag);
   $content = $content->item(0);
   return($content->firstChild->textContent);
}

for($i = 0; $i < $items->length; i++)
{
    $a = array();
    $item = $items->item($i);
    array_push($a,  getTag($item, "title"));
   ... and so on for each tag of the item...

    array_push($FeedArray, $a);
}

We placed each article in a two-dimensional table that can be simply displayed or used as we want. The loop will be put in the getTags function.

Functions of the complete reader

We now have a list of all functions useful for the universal reader.

ExtractChannel extracts the tag of the channel into an object.
ExtractItems extracts items of the document as an object.
GetTag Reads data from a tag.
GetTags Place the contents of an element (article or channel) in an array.

With the appropriate parameters, these functions are used for all formats.

Universal_Reader Englobes the entire process for a given feed, the format being unspecified.
Universal_Display Customizable functin to display a feed into an HTML page.

Loading the feed

In the most basic case, the feed is intended to be integrated into a Web page, either before its loading or further at user request.

Whatever the format is, especially for feeds in languages with accents, care must be taken to the compatibility of the encoding format, which is most often UTF-8 for the feed and sometimes ISO-8159 or windows-1252 for the page where it will appear. It is better to give the UTF-8 format to the page to avoid a bad display of accented characters.

The encoding is given by the content-type meta with a line in the following format:

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

Loading with the page

To see a page that includes a feed, insert the following code in the HTML code:

<?php
include("universal-reader.php");
Universal_Reader("http://www.scriptol.com/rss.xml");
echo Universal_Display();
?>

See the demonstration given below.

Loading at request

This case arises when the visitor chooses a feed in a list or enters the name of the feed.
The loading can be done with Ajax for an asynchronous display or only in PHP by displaying the whole page again.
We will use a form with an input text field to give the URL of the feed or a single link (or a choice of links) on which one click to see a feed.

Demonstration

More

Forum

Problem with atom links in the universal feed reader.