ArchiveBot

ArchiveBot, an IRC bot for archiving websites

  1. ArchiveBot

    Coders, I have a question. Or, a request, etc. I spent some time with xmc discussing something we could do to make things easier around here. What we came up with is a trigger for a bot, which can be triggered by people with ops. You tell it a website. It crawls it. WARC. Uploads it to archive.org. Boom. I can supply machine as needed. Obviously there’s some sanitation issues, and it is root all the way down or nothing. I think that would help a lot for smaller sites Sites where it’s 100 pages or 1000 pages even, pretty simple. And just being able to go “bot, get a sanity dump”

  2. More info

ArchiveBot has two major backend components: the control node, which runs the IRC interface and bookkeeping programs, and the crawlers, which do all the Web crawling. ArchiveBot users communicate with ArchiveBot by issuing commands in an IRC channel.

User’s guide: http://archivebot.readthedocs.org/en/latest/ Control node installation guide: INSTALL.backend Crawler installation guide: INSTALL.pipeline

  1. License

Copyright 2013 David Yip; made available under the MIT license. See LICENSE for details.

  1. Acknowledgments

Thanks to Alard (@alard), who added WARC generation and Lua scripting to GNU Wget. Wget+lua was the first web crawler used by ArchiveBot.

Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot’s current web crawler.

Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns and tracking down performance problems at scale.

Other thanks go to the following projects:

Dragonette, Barnaby Bright, Vienna Teng, NONONO.

The memory hole of the Web has gone too far. Don’t look down, never look away; ArchiveBot’s like the wind.

vim:ts=2:sw=2:tw=72:et

Related Repositories

ArchiveBot

ArchiveBot

ArchiveBot, an IRC bot for archiving websites ...

slack-archivebot

slack-archivebot

Automatically archive empty or inactive Slack channels ...

ArchiveBot-agents

ArchiveBot-agents

Site-specific agents that work with ArchiveBot ...

archivebot

archivebot

IRC bot for collecting information from an infobot and posting it en masse to a Redmine wiki page. ...

ArchiveBot

ArchiveBot

Send web links via Telegram bot and archive to webpage. ...


Top Contributors

yipdw ivan chfoo mback2k PressStartandSelect 12As Sanqui Asparagirl riking nsapa