- MySitemapGenerator
- Help Desk
- Using Crawler
Using Crawler
To successfully crawl a website, the following conditions must be met:
- Website is publicly accessible (no login or IP restrictions)
- Server returns valid and recognizable HTML content
- Homepage contains links to internal pages
- Website allows crawler access (not blocked via robots.txt or firewall rules)
Crawling begins with the homepage. The crawler discovers and processes internal links found there and continues navigating the site structure based on those links.
The homepage may include a redirect within the same hostname (for example, from http://example.com to https://example.com), and this will still be processed correctly.
The root URL is the base address used to access your website on a web server. It consists of two required components:
- Protocol scheme (usually https:// or http://)
- Domain name (e.g., website.tld)
Valid root URL examples:
- https://website.tld
- http://subdomain.website.tld
We support:
- all domain types (TLDs, ccTLDs, and subdomains at any level)
- Internationalized Domain Names (IDNs) for most languages, including Arabic, Indic, and Cyrillic domains
You do not need to convert IDNs to Punycode - simply enter the URL in its original language form.
Optionally, the root URL may include a language indicator. This applies only to the Product Feed Generator.
Only one language-version format is supported. It must correspond to a root folder and contain:
- a two-letter language code (ISO 639-1)
- optionally followed by a region code (ISO 3166-1 Alpha-2), separated by a dash
Examples of valid language URLs:
- https://mydomain.com/en
- https://mydomain.com/en-US
MySitemapGenerator supports both HTTP and HTTPS.
Please note: According to the XML Sitemaps protocol specification, crawling and data generation are performed only for the protocol specified in the root URL.
Yes. This behavior is optional but enabled by default.
When enabled, the crawler follows the Allow and Disallow rules defined in:
- the general User-agent: * section
- or a crawler-specific section, if applicable
“Personal” user-agent sections (such as Googlebot or Yandex) are considered when selecting a crawler identification mode.
You may also define rules specifically for our crawler:
User-agent: Mysitemapgenerator
Example robots.txt:
#Prevent all bots from crawl a specific directory
User-agent: *
Disallow: /noindex-directory/
#Google-specific rule
User-agent: Googlebot
Disallow: /noindex-directory/disallow-google.html
#Yandex-specific rules
User-agent: Yandex
Disallow: /noindex-directory/
Allow: /noindex-directory/allow-yandex.html
#Mysitemapgenerator rules
User-agent: Mysitemapgenerator
Disallow: /noindex-directory/
Allow: /noindex-directory/*.html
The Deep Web (also known as the Invisible Web) includes pages that are not indexed by search engines because they are not reachable through standard hyperlinks.
Examples include:
- pages generated via HTML forms
- content loaded inside frames or iframes
To discover and include such pages, enable the following options:
- Crawl HTML forms (form submission without input)
- Crawl framed content (<frameset> and <iframe>)
When enabled (default behavior), nofollow links are ignored.
You may also choose to:
- ignore only noindex
- ignore only nofollow
- or handle both independently
Nofollow link sources include:
- HTML links with the rel="nofollow" attribute
- links, placed on pages marked with a nofollow robots directive
When enabled (default), pages are processed according to:
- robots meta tags
- X-Robots-Tag HTTP headers
You can independently control processing of noindex and nofollow.
Crawler-specific meta directives (e.g. for Googlebot) are considered when selecting the crawler identification mode.
You may also use meta tags intended specifically for MySitemapGenerator.
Robots meta tag examples:
<meta name="robots" content="noindex" />
<meta name="robots" content="nofollow" />
<meta name="robots" content="noindex,nofollow" />
X-Robots-Tag HTTP header example:
X-Robots-Tag: noindex
X-Robots-Tag: nofollow
X-Robots-Tag: noindex, nofollow
The crawler recognizes the following HTTP status codes:
- 301 Moved Permanently
- 302 Found
- 303 See Other
- 307 Temporary Redirect
If a page redirects within the same domain, the crawler indexes the destination URL.
Yes - this is enabled by default.
When active, canonical directives are respected and non-canonical URLs are excluded from crawl results.
Canonical references are processed both:
- in HTML (via the <link rel="canonical"> tag)
- in HTTP headers (via the Link header)
HTML example:
<link rel="canonical" href="http://www.website.tld/canonical_page.html"/>
Link: <http://www.website.tld/canonical_page.html>; rel="canonical"
Technically, canonical references are treated similarly to a server-side redirect (HTTP 303) and may appear in reports as a “soft” redirect.
If the crawler encounters issues, a detailed error report is generated.
The report includes:
- grouped lists of crawl errors (e.g. “Page not found”, server errors)
- detected redirects
Note: Error reports are available to registered users only.
Crawling speed is dependent on the variability of many dynamic factors, such as the responsiveness of your web server and the size of the loaded pages. That is why it is impossible to calculate beforehand.
Also, a large impact on the time for website crawling has its structure of internal pages relinking.
By default, the crawler automatically adjusts speed based on server responsiveness.
You can manually set the crawl load level:
- Maximum - recommended for stable, paid hosting environments.
- Average – suitable for moderate server capacity.
- Low – minimal server load, recommended for free or limited hosting
(Note: this may significantly slows down crawling).
You may choose how the crawler identifies itself:
- Standard browser (default, recommended)
- Googlebot (Googlebot/2.1)
- YandexBot (YandexBot/3.0)
- Baiduspider
- Mysitemapgenerator (direct identification)
Behavior depends on the selected identification:
- When using Googlebot, YandexBot, Baiduspider, or Mysitemapgenerator, only rules for that specific user-agent are applied
- General rules (User-agent: *) are used only if no crawler-specific rules exist
- When using Standard browser or Mysitemapgenerator, only the Mysitemapgenerator or general section is considered
If your site uses client-side rendering, the crawler may attempt to process dynamically generated content when:
- JavaScript processing is enabled
- or automatically detected as necessary
JavaScript processing limitations:
- External scripts from other domains (CDNs, APIs, subdomains) are not executed
- User-triggered interactions (scrolling, clicking) are not simulated
- Only HTML <a> elements with an href attribute are treated as links
- Navigation implemented using non-standard link mechanisms will not be crawled.
Can’t find the answers you’re looking for? We’re here to help.
Contact support