Let's be straightforward about this topic, shall we? None of this
"kitten picture downloading" crap.
I've used at various times HTTrack (even with debug output, difficult
to find out why certain links don't get followed), PicPluck (crashes),
and various other tools (don't provide dynamic file renaming, a KEY
feature). I've been through practically the whole list at softpedia.
What I need:
-- ability to specify spider's behavior: depth, obey/disobey
robots.txt, go up/down/TLD/offsite, etc
-- specifiy filetypes and min size
-- dynamic file renaming, either with a md5 or somesuch of the
site/path or a user-defined structure (as w/ HTTrack)
-- drag and drop URLs from firefox
-- multithreading
-- stability
-- minimal "cleanup" (not leaving various "project" and "log" files everywhere)
Where is my holy grail? |