Sitemaps¶
The latest version of the starter pack generates a sitemap for your documentation using the sphinx-sitemap extension.
This page goes over the nuances of configuring sitemaps, as well as how the extension must be configured in your starter pack project.
Read the Docs-generated sitemaps¶
RTD generates a basic sitemap pointing to the index page, and relies on crawlers to index the site. This is sufficient for some projects, but RTD does not generate sitemaps for subprojects.
This means any project under the Ubuntu documentation library project must generate its own sitemap.
sphinx-sitemap-generated sitemaps¶
The standard Starter Pack uses the dirhtml builder for Sphinx recipes in the
project’s Makefile.
If your project uses an older version of the Starter Pack or
changes the builder, the links generated by the sitemap will be malformed. Either
update to the latest version of the Starter Pack or
ensure your project’s recipes use the dirhtml builder, not html.
Ensure sphinx-sitemap has been added to your docs/requirements.txt file.
Add sphinx_sitemap to extensions in your configuration file (docs/conf.py):
extensions = ['sphinx_sitemap']
Sitemap configuration¶
The Sphinx starter pack’s configuration file (docs/conf.py) includes default sitemap configuration.
The sphinx-sitemap extension requires a html_baseurl variable to be configured.
This is set by default as follows:
html_baseurl = os.environ.get("READTHEDOCS_CANONICAL_URL", "/")
When building on Read the Docs, this sets html_baseurl dynamically to the value of the
READTHEDOCS_CANONICAL_URL environment variable, which resolves to the full URL of the documentation
including the version and language (if applicable).
In local builds and builds on other hosts, html_baseurl defaults to /.
The sitemap_url_scheme variable is set to '{link}' by default. This uses the value of html_baseurl to generate
the full URL for each page for the sitemap.
Note
If you are implementing a sitemap on an RTD instance that is not a subproject,
and it uses {link} for the sitemap_url_scheme, RTD will replace your
sitemap with their own.
This is a known bug. The only current workaround is to use a different
sitemap name
and a custom robots.txt pointing to it.
lastmod configuration¶
As of version 2.7.0, the sitemap extension supports adding a lastmod date.
Make sure that your configuration file has:
sitemap_show_lastmod = True
Exclude pages¶
Pages can be excluded from the sitemap by adding them to sitemap_excludes in docs/conf.py:
sitemap_excludes = [
'404/',
'genindex/',
'search/',
]
Wildcards are supported. For example, _modules/* excludes the path _modules/ and all paths such as _modules/foo/bar/. For details, see Excluding Pages.
Validate your sitemap¶
A sitemap will be available at different locations, depending on how it is generated.
Read the Docs generated sitemaps are available at the base domain of a project, while sitemaps generated with this extension will be placed in the base of the URL schema used.
For example, two sitemaps are generated for the Sphinx sitemap’s documentation as it is hosted on RTD:
The first is generated by RTD and is available at the root of the domain: https://sphinx-sitemap.readthedocs.io/sitemap.xml
The second is generated by the sphinx-sitemap extension and is available at the base of the URL schema used by the RTD instance: https://sphinx-sitemap.readthedocs.io/en/latest/sitemap.xml
How to specify a sitemap
A robots.txt file dictates which sitemap is used to index a website. You can use a custom robots.txt file by creating your own and adding it to html_static_path in your configuration file. An example can be found in the Ubuntu documentation library project.
Support multiple versions¶
The sphinx-sitemap extension doesn’t support multiple versions by default. Configuring your versioned documentation to use an appropriate version may be sufficient, as search engines and other web systems crawl websites for the purposes of indexing.
If you want sitemaps for all your documentation’s versions, you need to deploy your own
robots.txt file and sitemap index. Supporting multiple versions is recommended for
documentation with LTS releases, as it makes past versions more prominent to search
engines.
For this task, we’ll use the Starter Pack as an example. Let’s assume it has three
versions, 1.0, 2.0, and 3.0, and uses the URL schema of <version>/<filename>.
First, ensure each version of your documentation has a sitemap generated by this extension with the appropriate version.
Next, create a sitemapindex.xml file in the same directory as the configuration
file, and point to the sitemap files of each of your documentation sets:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://canonical-starter-pack.readthedocs-hosted.com/stable/sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://canonical-starter-pack.readthedocs-hosted.com/3.0/sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://canonical-starter-pack.readthedocs-hosted.com/2.0/sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://canonical-starter-pack.readthedocs-hosted.com/1.0/sitemap.xml</loc>
</sitemap>
</sitemapindex>
Create a robots.txt file in the same directory as the configuration file.
If necessary, block any paths you don’t want crawled. Google describes how to do this in How to write and submit a robots.txt file.
At the end of robots.txt, point to the future path of sitemapindex.xml:
Sitemap: https://canonical-starter-pack.readthedocs-hosted.com/stable/sitemapindex.xml
Lastly, add both new files to the configuration file:
html_extra_path = [
"sitemapindex.xml",
"robots.txt",
]
This provides a sitemapindex.xml file which points to the sphinx-sitemap
generated sitemap for each version.
You may want to automate the generation of the sitemapindex.xml file. To see how
this is done for the Ubuntu documentation library project, which generates a sitemap
containing subproject sitemaps, see the script here.