mojibake 0 travis-ci ruby Rubygems

Recover mojibake text using a reverse-mapping table

3 years after

= MojiBake

== Description

Mojibake occurs in English most frequently due to misinterpreting and bad-transcoding between Windows-1252, ISO-8859-1, and UTF-8. This module provides a mojibake sequence to original character mapping table, and utility to recover mojibake’d text.

Testing has been with English but other Latin based languages, where Windows-1252 is in the wild, should also benefit.

== Dependencies

Mojibake mappings generation requires the String encoding support in ruby 1.9 as provided by:

  • ruby >= 1.9.2 (tested 1.9.2p290, 1.9.3p392, 2.0.0p247 Linux)
  • jruby ~> 1.6.5 or ~> 1.7.5 (tested 1.6.8, 1.7.5 Linux)
  • rubinius >= 2.0.0 (tested via Travis CI)

Note: jruby versions 1.7.0-1.7.4 have various encoding support regressions. Recovery of text with default settings is supported by using the pre-generated table.json when in ruby 1.8 mode, or for other jruby versions.

== Synopsis

gem install mojibake

require 'mojibake' mapper = MojiBake::Mapper.new mapper.recover( '“quotedâ€�' ) #=> '“quoted”'

Or via cli:

mojibake -h

List the mojibake mapping table (output in UTF-8):

mojibake -t

Recover from a text file:

mojibake input.txt

== License

Copyright (c) 2011-2015 David Kellum

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Related Repositories

bad-data-guide

bad-data-guide

An exhaustive reference to problems seen in real-world data along with suggestio ...

python-ftfy

python-ftfy

Given Unicode text, make its representation consistent and possibly less broken. ...

jsesc

jsesc

Given some data, jsesc returns the shortest possible stringified & ASCII-safe re ...

utf8proc

utf8proc

a clean C library for processing UTF-8 Unicode data ...

cali

cali

Interactive cal(1) (but the name ical was taken) ...


Top Contributors

dekellum maciejkowalski

Releases

-   mojibake-1.1.2 zip tar
-   mojibake-1.1.1 zip tar
-   mojibake-1.1.0 zip tar
-   mojibake-1.0.0 zip tar