I am a simple downloader that downloads images from URLs in a CSV and names them by the given column (after ensuring all its values are unique). I can organize your images into subfolders based on any column in your CSV, and will check for images already downloaded in your target folder. If you need square images for modeling, I'll create a second directory (organized in the same format) with downsized copies of your images. Patience is a virtue, so I will wait a designated time before re-requesting an image after receiving an error on my retry list; if all retries are expended or I receive another error, I log that for your review and move on. I also keep a log of all successful responses. After download, sum-buddy helps me gather and record checksums for all downloaded images. If the source CSV has a checksum column, I can then do a buddy-check to verify all expected images are downloaded intact. At a minimum, I check the number of expected images matches the number sum-buddy counts.
The Cautious Robot Logo was designed using Canva Magic Media.
Python 3.10+
pip install cautious-robotCautious-robot will check the provided CSV for IMG_NAME, URL, and SUBFOLDERS (if provided), then download all images that have a value in the IMG_NAME column. Note that choice of image filename should be unique; cautious-robot will refuse the request if the filename column selected is not unique within the dataset. It will also check if the images already exist in the provided OUTPUT folder to avoid overwriting existing files. Images that have a filename but no URL are recorded in the error log; the user is prompted whether to ignore or address missing filenames for URLs prior to downloading. Logs are saved in the same directory as the source CSV (logging is done by adding to an existing JSON, so it will not overwrite existing logs with the same name in case of a restarted download). Please note that if the streamed response is interrupted before the image is downloaded in its entirety this error may not be recorded in the error log, but the verifier would register them as missing.
If desired, a secondary output directory (OUTPUT_downsized) will be created with square copies of the images downsized to the specified size (e.g., 256 x 256). The folder structure of this secondary output directory will match that of the un-processed images. Parameters such as time to wait between retries on a failed download, the maximum number of times to retry downloading an image, and which index of the CSV to start with can all also be passed. Cautious-robot will retry image downloads when receiving one of the following HTTP response status codes: 429, 500, 502, 503, 504.
After downloading the images, cautious-robot calls sum-buddy to calculate and record checksums of the OUTPUT folder contents. It prints the number of images contained in the OUTPUT folder along with the expected number (based on a count of the unique, non-null filenames in the source file). If provided a column with checksums in the source file, it will then further verify that all expected images are downloaded through an inner merge on the checksum and filename columns of the source file with the checksum CSV (thus avoiding confusion in case of duplicate images).
usage: cautious-robot [-h] -i [INPUT_FILE] -o [OUTPUT_DIR] [-s [SUBDIR_COL]] [-n [IMG_NAME_COL]] [-u [URL_COL]] [-w WAIT_TIME]
[-r MAX_RETRIES] [-l SIDE_LENGTH] [-a CHECKSUM_ALGORITHM] [-v [VERIFIER_COL]]
options:
-h, --help show this help message and exit
required arguments:
-i [INPUT_FILE], --input-file [INPUT_FILE]
path to CSV file with urls.
-o [OUTPUT_DIR], --output-dir [OUTPUT_DIR]
main directory to download images into.
optional arguments:
-s [SUBDIR_COL], --subdir-col [SUBDIR_COL]
name of column to use for subfolders in image directory (defaults to flat directory if left blank)
-n [IMG_NAME_COL], --img-name-col [IMG_NAME_COL]
column to use for image filename (default: filename)
-u [URL_COL], --url-col [URL_COL]
column with URLs to download (default: file_url)
-w WAIT_TIME, --wait-time WAIT_TIME
seconds to wait between retries for an image (default: 3)
-r MAX_RETRIES, --max-retries MAX_RETRIES
max times to retry download on a single image (default: 5)
-l SIDE_LENGTH, --side-length SIDE_LENGTH
number of pixels per side for resized square images (default: no resized images created)
-a CHECKSUM_ALGORITHM, --checksum-algorithm CHECKSUM_ALGORITHM
checksum algorithm to use on images (default: md5, available: sha256, sha384, md5-sha1, blake2b, sha512,
sha1, sm3, sha3_256, sha512_256, sha224, sha3_224, ripemd160, sha3_384, shake_128, blake2s, md5, sha3_512,
sha512_224, shake_256)
-v [VERIFIER_COL], --verifier-col [VERIFIER_COL]
name of column in source CSV with checksums (same hash as -a) to verify download
Sample CSVs [1] are provided in the examples/ directory to test the CLI.
- Defaults:
cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images
Output:
100%|██████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.18it/s] Images downloaded from examples/HCGSD_testNA.csv to examples/test_images. Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl. Calculating md5 checksums on examples/test_images: 100%|███████████████████████████████████████████| 16/16 [00:00<00:00, 3133.00it/s] md5 checksums for examples/test_images written to examples/HCGSD_testNA_checksums.csv There are 8 files in examples/test_images. Based on examples/HCGSD_testNA.csv, there should be 8 images.
head -n 9 examples/HCGSD_testNA_checksums.csv
Output:
filepath,filename,md5 examples/test_images/10429021_V_lowres.png,10429021_V_lowres.png,c6aeb9d2f6db412ff5be0eb0b5435b83 examples/test_images/10428595_D_lowres.png,10428595_D_lowres.png,55882a0f3fdf8a68579c07254395653b examples/test_images/10428321_D_lowres.png,10428321_D_lowres.png,fbeeed30274e424831b06360b587ceb3 examples/test_images/10428250_V_lowres.png,10428250_V_lowres.png,14ac99b1a9913a9d420f21b94d6136d6 examples/test_images/10428140_V_lowres.png,10428140_V_lowres.png,c11538f2de5a5e2d6013fc800848d43a examples/test_images/10428803_D_lowres.png,10428803_D_lowres.png,d8bfb73f2d3556390de04aa98822b815 examples/test_images/10428169_V_lowres.png,10428169_V_lowres.png,042c9dc294d589ce3f140f14ddab0166 examples/test_images/10428972_V_lowres.png,10428972_V_lowres.png,0047e7454ce444f67fee1c90cc3ba9cb
- Download Images to Subfolders Based on Column Value:
cautious-robot -i examples/HCGSD_testNA.csv -o examples/test_images_subdirs --subdir-col Species
Output:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.47it/s] Images downloaded from examples/HCGSD_testNA.csv to examples/test_images_subdirs. Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl. Calculating md5 checksums on examples/test_images_subdirs: 100%|█████████████████████████████████████████████| 8/8 [00:00<00:00, 3106.60it/s] md5 checksums for examples/test_images_subdirs written to examples/HCGSD_testNA_checksums.csv There are 8 files in examples/test_images_subdirs. Based on examples/HCGSD_testNA.csv, there should be 8 images.
ls examples/test_images_subdirs
Output:
erato melpomene
head -n 9 examples/HCGSD_testNA_checksums.csv
Output:
filepath,filename,md5 examples/test_images_subdirs/melpomene/10428321_D_lowres.png,10428321_D_lowres.png,fbeeed30274e424831b06360b587ceb3 examples/test_images_subdirs/melpomene/10428250_V_lowres.png,10428250_V_lowres.png,14ac99b1a9913a9d420f21b94d6136d6 examples/test_images_subdirs/melpomene/10428140_V_lowres.png,10428140_V_lowres.png,c11538f2de5a5e2d6013fc800848d43a examples/test_images_subdirs/melpomene/10428169_V_lowres.png,10428169_V_lowres.png,042c9dc294d589ce3f140f14ddab0166 examples/test_images_subdirs/erato/10429021_V_lowres.png,10429021_V_lowres.png,c6aeb9d2f6db412ff5be0eb0b5435b83 examples/test_images_subdirs/erato/10428595_D_lowres.png,10428595_D_lowres.png,55882a0f3fdf8a68579c07254395653b examples/test_images_subdirs/erato/10428803_D_lowres.png,10428803_D_lowres.png,d8bfb73f2d3556390de04aa98822b815 examples/test_images_subdirs/erato/10428972_V_lowres.png,10428972_V_lowres.png,0047e7454ce444f67fee1c90cc3ba9cb
- Image Checksum Mismatch: one value is intentionally altered in the source CSV
cautious-robot -i examples/HCGSD_test_MD5_mismatch.csv -o examples/test_images_md5_mismatch --subdir-col Species -v "md5"
Output:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 4.23it/s] Images downloaded from examples/HCGSD_test_MD5_mismatch.csv to examples/test_images_md5_mismatch. Download logs are in examples/HCGSD_test_MD5_mismatch_log.jsonl and examples/HCGSD_test_MD5_mismatch_error_log.jsonl. Calculating md5 checksums on examples/test_images_md5_mismatch: 100%|████████████████████████████████| 8/8 [00:00<00:00, 4159.98it/s] md5 checksums for examples/test_images_md5_mismatch written to examples/HCGSD_test_MD5_mismatch_checksums.csv There are 8 files in examples/test_images_md5_mismatch. Based on examples/HCGSD_test_MD5_mismatch.csv, there should be 8 images. Image mismatch: 1 image(s) not aligned, see examples/HCGSD_test_MD5_mismatch_missing.csv for missing image info and check logs.
# Check on that mis-aligned image
head -n 2 examples/HCGSD_test_MD5_mismatch_missing.csvOutput:
nhm_specimen,species,subspecies,sex,file_url,filename,md5,in_img_dir 10428972,erato,petiverana,male,https://github.com/Imageomics/dashboard-prototype/raw/main/test_data/images/ventral_images/10428972_V_lowres.png,10428972_V_lowres.png,mismatch,False
- Download Partially Existing Images: some (or all) images may already exist in the output directory
# 1. Download the images
cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_images
# 2. Remove some of the images
rm ./examples/test_images/104281*
# 3. Download the same set of images to get only those removed at 2
cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_imagesOutput:
There are 6 of the desired files already in examples/test_images. Based on examples/HCGSD_testNA.csv, 2 images should be downloaded. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:32<00:00, 76.36s/it] Images downloaded from examples/HCGSD_testNA.csv to examples/test_images. Download logs are in examples/HCGSD_testNA_log.jsonl and examples/HCGSD_testNA_error_log.jsonl. Calculating md5 checksums on examples/test_images: 100%|████████████████████████████████| 8/8 [00:00<00:00, 4159.98it/s] md5 checksums for examples/test_images written to examples/HCGSD_testNA_checksums.csv There are 8 files in examples/test_images. Based on examples/HCGSD_testNA.csv, there should be 8 images.
# Attempt to download the same set of images
cautious-robot --input-file examples/HCGSD_testNA.csv --output-dir examples/test_imagesOutput:
'examples/test_images' already contains all images. Exited without executing.
To develop the package further:
- Clone the repository and create a branch.
- Install with dev dependencies:
pip install -e ".[dev]"- Install pre-commit hook:
pre-commit install
pre-commit autoupdate # optionally update- Run tests:
pytest[1] The test images are from the Cuthill Gold Standard Dataset, which was processed from Cuthill, et. al. (original dataset available at doi:10.5061/dryad.2hp1978).
