medusa ruby Rubygems

Medusa web-spider framework

3 years after

Medusa

Medusa is a web spider framework that can spider a domain and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized spider tasks quickly and easily.

Features

  • Multi-threaded design for high performance
  • Tracks 301 HTTP redirects
  • Built-in BFS algorithm for determining page depth
  • Allows exclusion of URLs based on regular expressions
  • Choose the links to follow on each page with focus_crawl()
  • HTTPS support
  • Records response time for each page
  • CLI program can list all pages in a domain, calculate page depths, and more
  • Obey robots.txt
  • In-memory or persistent storage of pages during crawl, using TokyoCabinet, SQLite3, MongoDB, or Redis
  • Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).

Examples

See the scripts under the lib/Medusa/cli directory for examples of several useful Medusa tasks.

TODO

  • [ ] Simplify storage module using Moneta, see #1
  • [ ] Add multiverse of ruby versions and runtimes in test suite
  • [ ] Solve memory issues with a persistent Queue
  • [ ] Improve docs & examples
  • [ ] Allow to control the crawler, eg: "stop", "resume"
  • [ ] Improve logging facilities to collect stats, catch errors & failures
  • [ ] Add the concept of "bots" or drivers to interact with pages (eg: capybara)

Do you have an idea? Open an issue so we can discuss it

Requirements

  • nokogiri
  • robots

Development

To test and develop this gem, additional requirements are:

  • rspec
  • fakeweb
  • tokyocabinet
  • kyotocabinet-ruby
  • mongo
  • redis
  • sqlite3

You will need to have KyotoCabinet, Tokyo Cabinet, MongoDB, and Redis installed on your system and running.

Related Repositories

django-medusa

django-medusa

A super simple "static site generator" Django app. (Unmaintained: see README for ...

poseidon-medusa

poseidon-medusa

HTTP proxy module based on Poseidon. ...

boot-medusa

boot-medusa

Boot task that creates a dependency graph from your project's namespaces ...

django-medusa

django-medusa

A super simple "static site generator" Django app. ...

django-jackfrost

django-jackfrost

A static page generator for Django, vaguely akin to django-medusa. ...


Top Contributors

chriskite brutuscat tilsammans mislav spk rb2k nehhen jasonkim skojin ZirconCode paresharma brownbeagle bernd lpradovera

Releases

-   v0.7.2 zip tar
-   v0.7.1 zip tar
-   v0.7.0 zip tar
-   v0.6.0 zip tar
-   v0.5.0 zip tar
-   v0.4.0 zip tar
-   0.0.1 zip tar