smart_open travis-ci python

Utils for streaming large files (S3, HDFS, gzip, bz2...)


smart_open – utils for streaming large files

|Travis|_ |License|_

.. |Travis| image:: .. |License| image:: .. _Travis: .. _License:


smart_open is a Python 2 & Python 3 library for efficient streaming of very large files from/to S3, HDFS, WebHDFS or local (compressed) files. It is well tested (using moto <>_), well documented and sports a simple, Pythonic API:

.. code-block:: python

stream lines from an S3 object

for line in smart_open.smart_open(‘s3://mybucket/mykey.txt’): … print line

you can also use a boto.s3.key.Key instance directly:

key = boto.connect_s3().get_bucket(“my_bucket”).get_key(“my_key”) with smart_open.smart_open(key) as fin: … for line in fin: … print line

can use context managers too:

with smart_open.smart_open(‘s3://mybucket/mykey.txt’) as fin: … for line in fin: … print line … # seek to the beginning … print # read 1000 bytes

stream from HDFS

for line in smart_open.smart_open(‘hdfs://user/hadoop/my_file.txt’): … print line

stream from WebHDFS

for line in smart_open.smart_open(‘webhdfs://host:port/user/hadoop/my_file.txt’): … print line

stream content into S3 (write mode):

with smart_open.smart_open(‘s3://mybucket/mykey.txt’, ‘wb’) as fout: … for line in [‘first line’, ‘second line’, ‘third line’]: … fout.write(line + ‘\n’)

stream content into WebHDFS (write mode):

with smart_open.smart_open(‘webhdfs://host:port/user/hadoop/my_file.txt’, ‘wb’) as fout: … for line in [‘first line’, ‘second line’, ‘third line’]: … fout.write(line + ‘\n’)

stream from/to local compressed files:

for line in smart_open.smart_open(‘./foo.txt.gz’): … print line

for line in smart_open.smart_open(‘~/foo.txt.gz’): … print line

with smart_open.smart_open(‘/home/radim/foo.txt.bz2’, ‘wb’) as fout: … fout.write(“some content\n”)

Since going over all (or select) keys in an S3 bucket is a very common operation, there’s also an extra method smart_open.s3_iter_bucket() that does this efficiently, processing the bucket keys in parallel (using multiprocessing):

.. code-block:: python

get all JSON files under “mybucket/foo/”

bucket = boto.connect_s3().get_bucket(‘mybucket’) for key, content in s3_iter_bucket(bucket, prefix=‘foo/’, accept_key=lambda key: key.endswith(‘.json’)): … print key, len(content)

For more info (S3 credentials in URI, minimum S3 part size…) and full method signatures, check out the API docs:

.. code-block:: python

import smart_open help(smart_open.smart_open_lib)

S3-Specific Options

There are a few optional keyword arguments that are useful only for S3 access.

.. code-block:: python

smart_open.smart_open(‘s3://’, host=‘’) smart_open.smart_open(‘s3://’, profile_name=‘my-profile’)

These are both passed to boto.s3_connect() as keyword arguments. The S3 reader supports gzipped content, as long as the key is obviously a gzipped file (e.g. ends with “.gz”).


Working with large S3 files using Amazon’s default Python library, boto <>_, is a pain. Its key.set_contents_from_string() and key.get_contents_as_string() methods only work for small files (loaded in RAM, no streaming). There are nasty hidden gotchas when using boto’s multipart upload functionality, and a lot of boilerplate.

smart_open shields you from that. It builds on boto but offers a cleaner API. The result is less code for you to write and fewer bugs to make.


The module has no dependencies beyond Python >= 2.6 (or Python >= 3.3), boto and requests::

pip install smart_open

Or, if you prefer to install from the source tar.gz <>_::

python test  # run unit tests
python install

To run the unit tests (optional), you’ll also need to install mock <>_ , moto <>_ and responses <> (pip install mock moto responses). The tests are also run automatically with Travis CI <>_ on every commit push & pull request.


smart_open is an ongoing effort. Suggestions, pull request and improvements welcome!

On the roadmap:

  • better documentation for the default file:// scheme

Comments, bug reports

smart_open lives on github <>_. You can file issues or pull requests there.

smart_open is open source software released under the MIT license <>. Copyright © 2015-now Radim Řehůřek <>.

Related Repositories



Utils for streaming large files (S3, HDFS, gzip, bz2...) ...

Top Contributors

piskvorky tmylk ziky90 VincTheSecond asieira salilb pombredanne mpenkov robcowie nikicc jsphpl gojomo andreycizov Janrain-Colin msabramo petedmarsh ellimilial kalaidin yupbank


-   1.3.5 zip tar
-   1.3.4 zip tar
-   1.3.3 zip tar
-   1.3.2 zip tar
-   1.3.1 zip tar
-   1.3.0 zip tar
-   1.3.0rc1 zip tar
-   1.2.1 zip tar