Crawler Policy – CheckMark Network

About CheckMark Network’s Crawler

CheckMark Network uses a crawler to extract data from the web. A crawler is a web crawling bot (sometimes also called a “spider”), which discovers new pages to be added to our knowledge base for further processing.

How it works

CheckMark Networks’s crawler discovers sites by following links from page to page. Our crawler was developed to have the smallest possible memory footprint, processor time consumption, and usually creates a very small load on the web servers it visits (for example, we pace our crawler moderately to allow web servers to service other visitors.).

The CheckMarkNetwork.com crawler follows the Googlebot specification and can be identified as follows:

User agent : CheckMarkNetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

Like Google, the IP addresses used by checkmarknetwork.com change from time to time. The best way to identify accesses by checkmarknetwork.com is to use the user-agent.

We use the Googlebot user agent specifications because it’s a well-established specification for crawlers and also many website owners already have suitable robots.txt rules that define the rules for a crawler like ours.

Policy

Our crawler always respects the common crawling norm such as the following:
Our crawler accesses each site in a page-by-page manner with some intervals. Currently, we fetch only a limited number of pages from each server.

Though we interleave the crawling processes with the processes for detecting host aliases, chances are that an aliased server may be accessed simultaneously under different host names.

It always reads the robots.txt file and never crawls restricted pages.

You can specify directives to the crawler in robots.txt file at the top of your site. For example, the following directive forbids our crawler to retrieve any content from your site.

User-agent: CheckMarkNetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)

Disallow: /

Once you’ve created your robots.txt file, there may be a small delay before your changes are discovered.

If checkmarknetwork.com is still crawling content you’ve blocked in robots.txt, check that the robots.txt is in the correct location. It must be in the top directory of the server (for example, www.example.com/robots.txt); placing the file in a subdirectory won’t have any effect.

We exercise great care regarding the management of web pages and related content. We register collected web pages and content in our databases. This is a limited use database. Access is restricted to authorised users and advanced security measures are in place to prevent unauthorized access.

Contact

For any questions or comments, please e-mail support@checkmarknetwork.com.

Please be sure to include host name(s) and IP address(es) of your site in the email so we can better assist you.

Thank you for your continued cooperation and support.