Medusa is a web spider framework that can spider a domain and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized spider tasks quickly and easily.
- Multi-threaded design for high performance
- Tracks 301 HTTP redirects
- Built-in BFS algorithm for determining page depth
- Allows exclusion of URLs based on regular expressions
- Choose the links to follow on each page with focus_crawl()
- HTTPS support
- Records response time for each page
- CLI program can list all pages in a domain, calculate page depths, and more
- Obey robots.txt
- In-memory or persistent storage of pages during crawl, using TokyoCabinet, SQLite3, MongoDB, or Redis
- Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
See the scripts under the lib/Medusa/cli directory for examples of several useful Medusa tasks.
- [ ] Simplify storage module using Moneta, see #1
- [ ] Add multiverse of ruby versions and runtimes in test suite
- [ ] Solve memory issues with a persistent Queue
- [ ] Improve docs & examples
- [ ] Allow to control the crawler, eg: “stop”, “resume”
- [ ] Improve logging facilities to collect stats, catch errors & failures
- [ ] Add the concept of “bots” or drivers to interact with pages (eg: capybara)
Do you have an idea? Open an issue so we can discuss it
To test and develop this gem, additional requirements are: - rspec - fakeweb - tokyocabinet - kyotocabinet-ruby - mongo - redis - sqlite3