Gutenberg:Information About Robot Access to our Pages
From Project Gutenberg, the first producer of free electronic books (ebooks).
Contents |
There are about 45,000 visitors to the Project Gutenberg web site every day.
They like fast response times and high download rates. If you access our site with a robot program you will slow down all these people. Please think of that, before roboting our site.
Better Alternatives
Robot access to our site should be left as last resource, when everything else has failed. Also, remember that the Project Gutenberg web site is copyrighted.
Before robotting our web site, please consider the alternatives we offer for the most common tasks. These alternatives are both easier for you and produce less load on our servers, whilst giving you the same or better results:
- Get an offline version of the Project Gutenberg web site.
- Get all Project Gutenberg ebook files.
- Get the Project Gutenberg catalog data.
If you are robotting our site for other reasons, consider contacting the webmaster instead. We can show you a better alternative in most cases.
Rules for Robotting
If you still think you must robot our site despite the alternatives we offer, we request you to follow these rules:
- Configure your robot to obey /robots.txt.
If your robot does not obey /robots.txt it will download everything from our site. For instance, it will download all pages from our online book viewer, which (at a very conservative estimate of 100 pages per ebook) will give you 1,600,000 pages, all of them perfectly useless because they duplicate the contents of the ebook files. - Configure your robot to wait at least 2 seconds between requests.
If you robot our site slowly, we will accept the slight performance degradation your robot brings to the other users. If you wait 2 seconds between requests, you will achieve a download rate of ~40,000 requests a day, which is quite enough. As a comparison: when Google indexes our site, they do only 15,000 requests a day.
You may want to read the manual that came with your robot to learn how to make it respect these rules.
If you do not comply with these rules we'll have to block your IP address (or IP range) to protect our other users. This means you won't be able to access the Project Gutenberg web site from that computer (or organisation) any more.
Getting an Offline Version of our Site
We maintain a copy of the whole web site for offline viewing. This package does not include the ebook files. You can download the package, unpack it on your PC and use any browser to view the web site from your disk.
Getting All EBook Files
You can get all our eBooks in zipped files by pointing your robot at
http://www.gutenberg.org/robot/harvest
You will also get all our mp3 files, which we do not zip.
Here is an estimate of the data volume: (Nov 2004)
| Type | Files | GB | Estimated download time DSL 1 MBit/s |
|---|---|---|---|
| zip | 24,160 | 14.5 | 48 hours |
| mp3 | 12,865 | 91.5 | 9 days |
Unpacking the zip files will produce another 70,000 files.
This is an example of how to get all files using wget:
wget -w 2 -m http://www.gutenberg.org/robot/harvest
wget is free software and available for Linux and Windows at www.gnu.org/software/wget/.
If you don't want the mp3 files, say:
wget -w 2 -m -R "mp3" http://www.gutenberg.org/robot/harvest
If you want only some types of files say:
wget -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&filetypes[]=html
Replace txt and html with the file types you are interested in.
If you want only files in a given language say:
wget -w 2 -m http://www.gutenberg.org/robot/harvest?langs[]=de
Replace de with the ISO language code you are interested in. Tip: you can learn the language code of any language in the Project Gutenberg catalog by looking at the status window of your browser while moving your cursor over the language at this page.
Mirroring EBook Files
If you want to robot our eBook files on a regular basis, eg. to maintain a mirror site, read the mirroring howto. It explains how to use rsync or wget to do this.
Getting Catalog Data
If you are roboting the web site to extract catalog data, you are wasting both your time and our resources. You can get the data much easier if you just grab the Project Gutenberg catalog in machine-readable format. The catalog data is licensed under the GNU GPL.
N.B. The Project Gutenberg web site is copyrighted. You are not allowed to use any data you harvest directly from the web site for anything except personal use.This is another good reason to grab the machine-readable catalog instead.