Fine Tuning Crawl Configuration (v 1.0)

On this page:

Tuning Crawl

Because the crawling of Confluence content is done by a generic Web Site crawler, there may be additional configuration steps needed for some content to be crawled.

The following steps are likely needed for Confluence attachments to be crawled.

The crawler may fail to crawl some content unless you add a new "action" file type to the list of files that are crawled. This is done by going to the Search Settings within your Shared Service Provider (from SharePoint Central Administration) and clicking on the "File Types" link. From there you can add the "action" file type.

Preventing Confluence from crawling some content

After adding the "action" file type, the crawl can take considerably longer because many more pages are now crawled including administrative pages. To remove these unwanted pages from the crawl log you must set up exclude crawl rules. This is done by going to the Search Settings within your Shared Service Provider (from SharePoint Central Administration) and clicking on the "Crawl Rules" link. From there you need to add the following exclusion:

  • <Confluence URL>/download/temp/*
  • <Confluence URL>/spaces/usage/*
  • <Confluence URL>/admin/*
  • <Confluence URL>/users/*

To locate the crawl rules configuration, from SharePoint Central Administration:

  1. Click on the shared service provider, likely named "SharedServices1" in the quick launch bar along the left side of the web page.
  2. Click on Search settings.
  3. In the Configure Search Settings page click on Crawl rules.

Note that you already had a crawl rule for your Confluence site that should remain untouched except that the order of these must be set such that the exclude rules are above the include rule. Note also that you do not want to check the "Include complex URLs" checkbox within each of the exclude crawl rules.

After making the above changes you can start another full crawl of your content source.

Ideally we would also exclude "<Confluence URL>/spaces/*", but that appears to prevent the crawling of attachments. This can be a little misleading in that attachments in the demonstration space are crawled even if you have this exclude rule. However, other attachments do not appear to be crawled.

Crawling attachments

If the above changes do not help with crawling attachments, try changing your Confluence content source to use a start address that points to the "/spaces/listattachmentsforspace.action" page. Then start another full crawl of your content source.

This start address should allow for all Confluence content to be crawled and not just attachments. If you feel that is not the case for your environment, you may want to put both your base Confluence URL and the "list attachments for space" URL in the start address field.

RELATED TOPICS:

Search Configuration for Search Server 2008
SharePoint Search Prerequisite Updates

Take me back to SharePoint Search Configuration.