LumberMill travis-ci python

Python LogParser

.. image:: https://readthedocs.org/projects/lumbermill/badge/?version=latest :target: http://lumbermill.readthedocs.org/en/latest/?badge=latest :alt: Documentation Status .. image:: https://travis-ci.org/dstore-dbap/LumberMill.svg?branch=master :target: https://travis-ci.org/dstore-dbap/LumberMill .. image:: https://coveralls.io/repos/dstore-dbap/LumberMill/badge.svg?branch=master&service=github :target: https://coveralls.io/github/dstore-dbap/LumberMill?branch=master

LumberMill

Introduction “”“”“”

Collect, parse and store logs with a configurable set of modules. Inspired by logstash <https://github.com/elasticsearch/logstash>_ but with a smaller memory footprint and faster startup time.

Compatibility and Performance “”“”“”“”“”“”“”’ To run LumberMill you will need Python 2.7+. For better performance, I heartly recommend running LumberMill with pypy. The performance gain can be up to 5-6 times events/s throughput running single processed. Tested with pypy-2.4, pypy-2.5 and pypy-4.1. A small benchmark comparing the performance of different python/pypy versions and logstash-1.4.2 can be found here <http://www.netprojects.de/simple-benchmark-of-lumbermill/>_.

Installation “”“”“”

via pypi

::

pip install LumberMill

manually

Clone the github repository to /opt/LumberMill (or any other location that fits you better :):

::

 git clone https://github.com/dstore-dbap/LumberMill.git /opt/LumberMill
Install the dependencies with pip:

::

 cd /opt/LumberMill
 python setup.py install
You may need the MaxMind geo database. Install it with:

::

mkdir /usr/share/GeoIP
cd /usr/share/GeoIP
wget "http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz"
gunzip GeoLiteCity.dat.gz
Now you can give LumberMill a testdrive with:

::

wget https://raw.githubusercontent.com/dstore-dbap/LumberMill/master/conf/example-stdin.conf
echo "I'm a lumberjack, and I'm okay" | lumbermill -c ./example-stdin.conf

If you get a “command not found” please check your pythonpath setting. Depending on how you installed LumberMill, the executable can either be found in the bin dir of your python environment (e.g. /usr/lib64/pypy-2.4.0/bin/lumbermill) or in your default path (e.g. /usr/local/bin/lumbermill).

Other basic configuration examples: https://github.com/dstore-dbap/LumberMill/tree/master/conf/.

For a how-to running LumberMill, Elasticsearch and Kibana on CentOS, feel free to visit http://www.netprojects.de/collect-visualize-your-logs-with-lumbermill-and-elasticsearch-on-centos/.

Configuration example (with explanations) “”“”“”“”“”“”“”“”“”“”’

Below, I will explain each section in more detail.
Below, I will explain each section in more detail.
Below, I will explain each section in more detail.
Below, I will explain each section in more detail.
Below, I will explain each section in more detail.
Below, I will explain each section in more detail.
I will explain each section in more detail.

::

# Sets number of parallel LumberMill processes.
- Global:
   workers: 2

# Listen on all interfaces, port 5151.
- TcpServer:
   port: 5151
   receivers:
    - RegexParser

# Listen on all interfaces, port 5152.
- TcpServer:
   port: 5152
   mode: stream
   chunksize: 32768

# Decode msgpacked data.
- MsgPackParser:
   mode: stream

# Extract fields.
- RegexParser:
   source_field: data
   hot_rules_first: True
   field_extraction_patterns:
    - httpd_access_log: '(?P<remote_ip>\d+\.\d+\.\d+\.\d+)\s+(?P<identd>\w+|-)\s+(?P<user>\w+|-)\s+\[(?P<datetime>\d+\/\w+\/\d+:\d+:\d+:\d+\s.\d+)\]\s+\"(?P<url>.*)\"\s+(?P<http_status>\d+)\s+(?P<bytes_send>\d+)'
    - http_common_access_log: '(?P<remote_ip>\d+\.\d+\.\d+\.\d+)\s(?P<x_forwarded_for>\d+\.\d+\.\d+\.\d+)\s(?P<identd>\w+|-)\s(?P<user>\w+|-)\s\[(?P<datetime>\d+\/\w+\/\d+:\d+:\d+:\d+\s.\d+)\]\s\"(?P<url>.*)\"\s(?P<http_status>\d+)\s(?P<bytes_send>\d+)'
    - iptables: '(?P<syslog_prival>\<\d+\>)(?P<log_timestamp>\w+\s+\d+\s+\d+:\d+:\d+)\s+(?P<host>[\w\-\._]+)\s+kernel:.*?\ iptables\ (?P<iptables_action>.*?)\ :\ IN=(?P<iptables_in_int>.*?)\ OUT=(?P<iptables_out_int>.*?)\ SRC=(?P<iptables_src>.*?)\ DST=(?P<iptables_dst>.*?)\ LEN=(?P<iptables_len>.*?)\ .*?PROTO=(?P<iptables_proto>.*?)\ SPT=(?P<iptables_spt>.*?)\ DPT=(?P<iptables_dpt>.*?)\ WINDOW=.*'
   receivers:
    - SimpleStats:
       filter: $(lumbermill.event_type) != 'Unknown'
    # Print out messages that did not match
    - StdOutSink:
       filter: $(lumbermill.event_type) == 'Unknown'

# Print out some stats every 10 seconds.
- SimpleStats:
   interval: 10

# Extract the syslog prival from events received via syslog.
- SyslogPrivalParser:
   source_field: syslog_prival

# Add a timestamp field.
- AddDateTime:
   format: '%Y-%m-%dT%H:%M:%S.%f'
   target_field: "@timestamp"

# Add geo info based on the lookup_fields. The first field in <source_fields> that yields a result from geoip will be used.
- AddGeoInfo:
   geoip_dat_path: /usr/share/GeoIP/GeoLiteCity.dat
   source_fields: [x_forwarded_for, remote_ip]
   geo_info_fields: ['latitude', 'longitude', 'country_code']

# Nginx logs request time in seconds with milliseconds as float. Apache logs microseconds as int.
# At least cast nginx to integer.
- Math:
   filter: if $(server_type) == "nginx"
   target_field: request_time
   function: float($(request_time)) * 1000

# Map field values of <source_field> to values in <map>.
- ModifyFields:
   filter: if $(http_status)
   action: map
   source_field: http_status
   map: {100: 'Continue', 200: 'OK', 301: 'Moved Permanently', 302: 'Found', 304: 'Not Modified', 400: 'Bad Request', 401: 'Unauthorized', 403: 'Forbidden', 404: 'Not Found', 500: 'Internal Server Error', 502: 'Bad Gateway'}

# Kibana’s ‘bettermap’ panel needs an array of floats in order to plot events on map.
- ModifyFields:
   filter: if $(latitude)
   action: merge
   source_fields: [longitude, latitude]
   target_field: geoip

# Extarct some fields from the user agent data.
- UserAgentParser:
   source_fields: user_agent

# Parse the url into its components.
- UrlParser:
   source_field: uri
   target_field: uri_parsed
   parse_querystring: True
   querystring_target_field: params

# Store events in elastic search.
- ElasticSearchSink:
   nodes: [localhost]
   store_interval_in_secs: 5

- StdOutSink
Let me explain it in more detail:

::

# Sets number of parallel LumberMill processes.
- Global:
   workers: 2
Default number of workers is CPU_COUNT - 1.
Default number of workers is CPU_COUNT - 1.
Default number of workers is CPU_COUNT - 1.
Default number of workers is CPU_COUNT - 1.
Default number of workers is CPU_COUNT - 1.
Default number of workers is CPU_COUNT - 1.
Default number of workers is CPU_COUNT - 1.
NT - 1.

::

# Listen on all interfaces, port 5151.
- TcpServer:
   port: 5151
   receivers:
    - RegexParser
This module will send its output directly to RegexParser.
  1. Each module comes with a set of default values, so you only need to provide settings you need to customize. For a description of the default values of a module, refer to the README.md in the modules directory or its docstring. By default, a module will send its output to the next module in the configuration. To set a custom receiver, set the receivers value. This module will send its output directly to RegexParser.

::

# Listen on all interfaces, port 5152.
- TcpServer:
   port: 5152
   mode: stream
   chunksize: 32768
module.
module.
module.
module.
module.
  1. The first tcp server uses newline as separator (which is the default) for each received event. Here, the sever reads in max. 32k of data and passes this on to the next module.

::

# Decode msgpacked data.
- MsgPackParser:
   mode: stream
python-beaver <https://github.com/josegonzalez/python-beaver>_
python-beaver <https://github.com/josegonzalez/python-beaver>_
https://github.com/josegonzalez/python-beaver>`_

::

# Extract fields.
- RegexParser:
   source_field: data
   hot_rules_first: True
   field_extraction_patterns:
    - httpd_access_log: '(?P<remote_ip>\d+\.\d+\.\d+\.\d+)\s+(?P<identd>\w+|-)\s+(?P<user>\w+|-)\s+\[(?P<datetime>\d+\/\w+\/\d+:\d+:\d+:\d+\s.\d+)\]\s+\"(?P<url>.*)\"\s+(?P<http_status>\d+)\s+(?P<bytes_send>\d+)'
    - http_common_access_log: '(?P<remote_ip>\d+\.\d+\.\d+\.\d+)\s(?P<x_forwarded_for>\d+\.\d+\.\d+\.\d+)\s(?P<identd>\w+|-)\s(?P<user>\w+|-)\s\[(?P<datetime>\d+\/\w+\/\d+:\d+:\d+:\d+\s.\d+)\]\s\"(?P<url>.*)\"\s(?P<http_status>\d+)\s(?P<bytes_send>\d+)'
    - iptables: '(?P<syslog_prival>\<\d+\>)(?P<log_timestamp>\w+\s+\d+\s+\d+:\d+:\d+)\s+(?P<host>[\w\-\._]+)\s+kernel:.*?\ iptables\ (?P<iptables_action>.*?)\ :\ IN=(?P<iptables_in_int>.*?)\ OUT=(?P<iptables_out_int>.*?)\ SRC=(?P<iptables_src>.*?)\ DST=(?P<iptables_dst>.*?)\ LEN=(?P<iptables_len>.*?)\ .*?PROTO=(?P<iptables_proto>.*?)\ SPT=(?P<iptables_spt>.*?)\ DPT=(?P<iptables_dpt>.*?)\ WINDOW=.*'
   receivers:
    - SimpleStats:
       filter: $(lumbermill.event_type) != 'Unknown'
    # Print out messages that did not match
    - StdOutSink:
       filter: $(lumbermill.event_type) == 'Unknown'
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:
would look like this:

::

   'lumbermill': {'event_id': '90818a85f3aa3af302390bbe77fbc1c87800',
                   'event_type': 'Unknown',
                   'pid': 7800,
                   'received_by': 'vagrant-centos65.vagrantup.com',
                   'received_from': '127.0.0.1:61430',
                   'source_module': 'TcpServer'}}
prior to store it in some backend.
prior to store it in some backend.
or to store it in some backend.

::

# Print out some stats every 10 seconds.
- SimpleStats:
   interval: 10
Prints out some simple stats every interval seconds.

::

# Extract the syslog prival from events received via syslog.
- SyslogPrivalParser:
   source_field: syslog_prival
RFC5424 <http://tools.ietf.org/html/rfc5424>_.
RFC5424 <http://tools.ietf.org/html/rfc5424>_.
5424>`_.

::

# Add a timestamp field.
- AddDateTime:
   format: '%Y-%m-%dT%H:%M:%S.%f'
   target_field: "@timestamp"
your event data, this field is required.
your event data, this field is required.
data, this field is required.

::

# Add geo info based on the lookup_fields. The first field in <source_fields> that yields a result from geoip will be used.
- AddGeoInfo:
   geoip_dat_path: /usr/share/GeoIP/GeoLiteCity.dat
   source_fields: [x_forwarded_for, remote_ip]
   geo_info_fields: ['latitude', 'longitude', 'country_code']
result will be used.
result will be used.
result will be used.
result will be used.
result will be used.
result will be used.
result will be used.
e used.

::

# Nginx logs request time in seconds with milliseconds as float. Apache logs microseconds as int.
# At least cast nginx to integer.
- Math:
   filter: if $(server_type) == "nginx"
   target_field: request_time
   function: float($(request_time)) * 1000
by this module.
by this module.
by this module.
by this module.
by this module.
by this module.
by this module.
by this module.
by this module.
by this module.
by this module.
by this module.
by this module.
by this module.
this module.

::

# Map field values of <source_field> to values in <map>.
- ModifyFields:
   filter: if $(http_status)
   action: map
   source_field: http_status
   map: {100: 'Continue', 200: 'OK', 301: 'Moved Permanently', 302: 'Found', 304: 'Not Modified', 400: 'Bad Request', 401: 'Unauthorized', 403: 'Forbidden', 404: 'Not Found', 500: 'Internal Server Error', 502: 'Bad Gateway'}
example numeric http status codes are mapped to human readable values.
ample numeric http status codes are mapped to human readable values.

::

# Kibana’s ‘bettermap’ panel needs an array of floats in order to plot events on map.
- ModifyFields:
   filter: if $(latitude)
   action: merge
   source_fields: [longitude, latitude]
   target_field: geoip
into the geoip field.
into the geoip field.
into the geoip field.
into the geoip field.
into the geoip field.
into the geoip field.
into the geoip field.

::

# Extarct some fields from the user agent data.
- UserAgentParser:
   source_fields: user_agent
   target_field: user_agent_info
user_agent_info.browser.name etc.
user_agent_info.browser.name etc.
user_agent_info.browser.name etc.
user_agent_info.browser.name etc.
c.

::

# Parse the url into its components.
- UrlParser:
   source_field: uri
   target_field: uri_parsed
   parse_querystring: True
   querystring_target_field: params
uri_parsed.scheme, uri_parsed.path, uri_parsed.query etc.
uri_parsed.scheme, uri_parsed.path, uri_parsed.query etc.
etc.

::

# Store events in elastic search.
- ElasticSearchSink:
   nodes: [localhost]
   store_interval_in_secs: 5
nodes to connect to.
nodes to connect to.
nodes to connect to.
nodes to connect to.
o.

::

- StdOutSink

Events received by this module will be printed out to stdout. The RegexParser module was configured to send unmatched events to this module.

The different modules can be combined in any order.

To run LumberMill you will need Python 2.5+. For better performance I recommend running LumberMill with pypy. Tested with pypy-2.0.2, pypy-2.2.1, pypy-2.3 and pypy-2.4. For IPC ZeroMq is used instead of the default multiprocessing.Queue. This resulted in nearly 3 times of the performance with multiprocessing.Queue.

Working modules “”“”“”“’

Event inputs ^^^^^^^^^^^^

  • ElasticSearch, get documents from elasticsearch.
  • Kafka, receive events from apache kafka.
  • NmapScanner, scan network with nmap and emit result as new event.
  • RedisChannel, read events from redis channels.
  • RedisList, read events from redis lists.
  • Sniffer, sniff network traffic.
  • Spam, what it says on the can - spams LumberMill for testing.
  • SQS, read messages from amazons simple queue service.
  • StdIn, read stream from standard in.
  • TcpServer, read stream from a tcp socket.
  • UdpServer, read data from udp socket.
  • UnixSocket, read stream from a named socket on unix like systems.
  • Zmq, read events from a zeromq.

Event parsers ^^^^^^^^^^^^^

  • Base64Parser, parse base64 data.
  • CollectdParser, parse collectd binary protocol data.
  • CSVParser, parse a char separated string.
  • DomainNameParser, parse a domain name or url to tld, subdomain etc. parts.
  • InflateParser, inflates any fields with supported compression codecs.
  • JsonParser, parse a json formatted string.
  • LineParser, split lines at a seperator and emit each line as new event.
  • MsgPackParser, parse a msgpack encoded string.
  • RegexParser, parse a string using regular expressions and named capturing groups.
  • SyslogPrivalParser, parse the syslog prival value (RFC5424).
  • UrlParser, parse the query string from an url.
  • UserAgentParser, parse a http user agent string.
  • XPathParser, parse an XML document via an xpath expression.

Event modifiers ^^^^^^^^^^^^^^^

  • AddDateTime, adds a timestamp field.
  • AddDnsLookup. adds dns data.
  • AddGeoInfo, adds geo info fields.
  • DropEvent, discards event.
  • ExecPython, execute custom python code.
  • Facet, collect all encountered variations of en event value over a configurable period of time.
  • HttpRequest, execute an arbritrary http request and store result.
  • Math, execute arbitrary math functions.
  • MergeEvent, merge multiple events to one single event.
  • ModifyFields, some methods to change extracted fields, e.g. insert, delete, replace, castToInteger etc.
  • Permutate, takes a list in the event data emits events for all possible permutations of that list.

Outputs ^^^^^^^

  • DevNullSink, discards all data that it receives.
  • ElasticSearchSink, stores data entries in an elasticsearch index.
  • FileSink, store events in a file.
  • GraphiteSink, send metrics to graphite server.
  • LoggerSink, sends data to lumbermill internal logger for output.
  • MongoDbSink, stores data entries in a mongodb index.
  • RedisChannelSink, publish incoming events to redis channel.
  • RedisListSink, publish incoming events to redis list.
  • StdOutSink, prints all received data to standard out.
  • SQSSink, sends events to amazons simple queue service.
  • SyslogSink, send events to syslog.
  • WebHdfsSink, store events in hdfs via webhdfs.
  • ZmqSink, sends incoming event to zeromq.

Misc modules ^^^^^^^^^^^^

  • EventBuffer, store received events in a persistent backend until the event was successfully handled.
  • KeyValueStore, simple wrapper around the python simplekv module.
  • RedisStore, use redis to store and retrieve values, e.g. to store the result of the XPathParser modul.
  • SimpleStats, simple statistic module just for event rates etc.
  • Statistics, more versatile. Configurable fields for collecting statistic data.
  • Tarpit, slows event propagation down - for testing.
  • Throttle, throttle event count over a given time period.

Cluster modules ^^^^^^^^^^^^^^^

  • Pack, base pack module. Handles pack leader and pack member discovery.
  • PackConfiguration, syncs leader configuration to pack members.

Webserver modules ^^^^^^^^^^^^^^^^^

  • WebGui, a web interface to LumberMill.
  • WebserverTornado, base webserver module. Handles all incoming requests.

Event flow basics “”“”“”“”’

  • an input module receives an event.
  • the event data will be wrapped in a default event dictionary of the following structure: { “data”: payload, “lumbermill”: { “event_id”: unique event id, “event_type”: “Unknown”, “received_from”: ip address of sender, “source_module”: caller_class_name, } }
  • the input module sends the new event to its receivers. Either by adding it to a queue or by calling the receivers handleEvent method.
  • if no receivers are configured, the next module in config will be the default receiver.
  • each following module will process the event via its handleEvent method and pass it on to its receivers.
  • each module can have an input filter and an output filter to manage event propagation through the modules.
  • output modules can not have receivers.

Configuration basics “”“”“”“”“”

configuration follows the same pattern:
configuration follows the same pattern:
llows the same pattern:

::

- SomeModuleName:
    id: AliasModuleName                     # <default: ""; type: string; is: optional>
    filter: if $(cache_status) == "-"
    add_fields: {'my_new_field': 'my_new_value'}
    delete_fields: ['drop_this_field', 'drop_that_field']
    event_type: my_custom_type
    receivers:
     - ModuleName
     - ModuleAlias:
         filter: if $('event_type') == 'httpd_access_log'
  • module: specifies the module name and maps to the class name of the module.
  • id: use to set an alias name if you run more than just one instance of a module.
  • filter: apply a filter to incoming events. Only matching events will be handled by this module.
  • add_fields: if the event is handled by the module add this fields to the event.
  • delete_fields: if the event is handled by the module delete this fields from the event.
  • event_type: if the event is handled by the module set event_type to this value.
  • receivers: ModuleName or id of the receiving modules. If a filter is provided, only matching events will be send to receiver. If no receivers are configured, the next module in config will be the default receiver.

For modules that support the storage of intermediate values in redis: * configuration[‘redis-client’]: name of the redis client as set in the configuration. * configuration[‘redis-key’]: key used to store the data in redis. * configuration[‘redis-ttl’]: ttl of the stored data in redis.

For configuration details of each module refer to its docstring.

Event field notation “”“”“”“”“”

The following examples refer to this event data:

::

{'bytes_send': '3395',
 'data': '192.168.2.20 - - [28/Jul/2006:10:27:10 -0300] "GET /wiki/Monty_Python/?spanish=inquisition HTTP/1.0" 200 3395\n',
 'datetime': '28/Jul/2006:10:27:10 -0300',
 'lumbermill': {
                'event_id': '715bd321b1016a442bf046682722c78e',
                'event_type': 'httpd_access_log',
                "received_from": '127.0.0.1',
                "source_module": 'StdIn',
  },
 'http_status': '200',
 'identd': '-',
 'remote_ip': '192.168.2.20',
 'url': 'GET /wiki/Monty_Python/?spanish=inquisition HTTP/1.0',
 'fields': ['nobody', 'expects', 'the'],
 'params':  { u'spanish': [u'inquisition']},
 'user': '-'}

Notation in configuration fields like source_field or target_field ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

dots:
dots:
dots:
dots:
dots:
dots:
dots:
dots:
dots:
dots:
dots:

::

- RegexParser:
    source_field: fields.2

- RegexParser:
    source_field: params.spanish

Notation in strings ^^^^^^^^^^^^^^^^^^^

use dots:
use dots:
use dots:
use dots:
use dots:
use dots:
use dots:
dots:

::

- ElasticSearchSink:
    index_name: 1perftests
    doc_id: $(fields.0)-$(params.spanish.0)

Notation in module filters ^^^^^^^^^^^^^^^^^^^^^^^^^^

Use $(variable_name) notation. If referring to a nested dict, use dots:

::

- StdOutSink:
    filter: if $(fields.0) == "nobody" and $(params.spanish.0) == 'inquisition'

Filters

Modules can have an input filter:

::

- StdOutSink:
    filter: if $(remote_ip) == '192.168.2.20' and re.match('^GET', $(url))
Modules can have an output filter:

::

- RegexParser:
    ...
    receivers:
      - StdOutSink:
          filter: if $(remote_ip) == '192.168.2.20' and re.match('^GET', $(url))

A rough sketch for using LumberMill with syslog-ng “”“”“”“”“”“”“”“”“”“”“”“”“’

Send e.g. apache access logs to syslog (/etc/httpd/conf/httpd.conf):

::

...
CustomLog "| /usr/bin/logger -p local1.info -t apache2" common
...
(/etc/syslog-ng.conf):
(/etc/syslog-ng.conf):
(/etc/syslog-ng.conf):
c/syslog-ng.conf):

::

...
destination d_gambolputty { tcp( localhost port(5151) ); };
filter f_httpd_access { facility(local1); };
log { source(s_sys); filter(f_httpd_access); destination(d_gambolputty); flags(final);};
... 
5151(./conf/lumbermill.conf):
5151(./conf/lumbermill.conf):
mill.conf):

::

...
- TcpServer:
    interface: localhost
    port: 5151
...

Related Repositories

LumberMill

LumberMill

Python LogParser ...


Top Contributors

dstore-dbap proteansec

Releases

-   v0.8 zip tar
-   v0.1 zip tar