associations

Python 3 module to identify and analyze associations in any data set.

Associations

Associations is a Python 3 module used to identify and analyze associations in any data set. It was originally created to aid in solving a specific set of problems. I have not included that original implementation code to respect the employer’s wishes.

As you examine this module, you will see there are some aspects of it that could be made more encapsulated, efficient, concise, or generalized. Once I start working on a reference implementation, I will begin to resolve these issues. I urge you to make pull requests if you see anything that could be improved.

Installation

The latest release can be found here. For a direct download of the development version (latest revision, not latest release), click here.

  • Dependencies: NumPy, matplotlib, multiprocessing
  • Build Dependencies: git (only for Arch Linux python-associations-git package)

This is not compatible with Python 2.

Universal

Run this if you would like to install associations directly into Python without the use of a package manager. This should be compatible with any system.

$ python setup.py sdist
# python setup.py

Arch Linux

If you would like to install the latest development version (latest revision), you should install the python-associations-git package. All you have to do is download the PKGBUILD and the ABS will automatically download the source and install the package. You can keep reusing the same PKGBUILD. It will automatically update the version number based on the revision.

To install python-associations-git, run this in an empty directory:

$ wget https://raw.githubusercontent.com/dnut/PKGBUILDs/master/assocations/python-associations-git/PKGBUILD
$ makepkg -si

If you would like to install the latest release, you should install the python-associations package. You can download it here and install using the included PKGBUILD. To update this package, you will need to download the release from that page.

To install python-associations, cd into the python-associations directory and run this command:

$ makepkg -si

You can also download the source code for the latest revision manually and install using the included PKGBUILD. I only recommend this if you are either contributing to development or forking your own local version of the package.

Overview

We can count occurrences with a histogram, find associations between different fields, and are provided tools that aid in the analysis of the resultant data.

libassoc.py

This file contains the most generic procedures that do not belong in any created classes. They are convenient procedures for Python’s fundamental data structures.

histogram.py - Histogram()

The primary job of a Histogram() object is to traverse a CSV file and create a NumPy array with as many dimensions as fields we wish to record and to fill that array with the count for every possible occurrence. This is accomplished with the count() method. Access to the internal data structure is provided via the get() method.

Attribute Description
fields Table fields that we want to measure.
histogram NumPy array containing counts.
valists List of lists containing strings of each field’s values.
valdicts List of dicts, inverted valists (key = string, val = int)
valists_dict Dict of valists keyed by field names.
valdicts_dict Dict of valdicts keyed by field names.
field_index Keys field values to field names.
field_index_int Keys field values to valists/valdicts index (int).
nonzero_indices Indices for all nonzero values in the histogram.
Method Description
count() Count all occurrences for every possible situation
useful_stuff() Expose the string values for quantitative internal data structure.
reduce() Return new Histogram() with provided numpy array. Used by simplify() and slice().
simplify() Return new Histogram() with fewer dimensions by summing undesired dimensions. For example, create a histogram that drops the sex dimension. All remaining fields have combined value for both male and female.
slice() Return new Histogram() with fewer dimensions by isolating a specific situation. For example, create a histogram representing only males with no data for females.
nonzeros() Generator function that iterates through every nonzero element, optionally providing string representations.
get() Retrieve count for any field value combination.

associations.py

Contains two classes that serve to identify associations in a Histogram(). Associator() finds associations for a specific field combination and Associations() uses Associator() objects to find all associations.

Associator() is a distinct class rather than integrating its methods into Associations() because Associations() uses multiprocessing to dramatically improve execution time on multi-core systems, and it needs relatively isolated objects to be passed to subprocesses. This implementation is intended to be superior to the redundancy of many Associations() objects or the complexity of queues and pipes without hurting code legibility or efficiency.

Associator()

The associator object identifies associations between different field values (eg. fatalities and amputations) by comparing one group to a larger group that encompasses it.

Knowing that white males are injured on Tuesday more frequently than black males is not very useful information because it is likely caused by there being more white males than black. Furthermore, knowing that while males are more injured on Tuesday than other days doesn’t tell us whether or not white males and Tuesday are associated because it may be that Tuesdays have more injuries overall. Therefore, we must establish a standardized numerical value that represents the actual association between two fields by taking into consideration the overall populations we are sampling from.

As another example, if we want to find the association between amputations and fatalities (diagnosis and disposition), we need to take the same approach. While the likelihood that an amputation is fatal is valuable information, we are more interested in the relative fatality of different diagnoses. Amputations may have a very low likelihood of fatality, but we must compare it to the likelihood that any other diagnosis leads to fatality before we discover whether amputations are relatively likely to be fatal. Therefore, we must take into consideration the extreme infrequency of fatalities in general to get a standardized numerical representation of how associated each field is.

There are two approaches to resolve our dilemma that are mathematically equivalent. One approach is to divide the number of fatal amputations by the number of amputations with any disposition, which yields the likelihood that an amputation is fatal. But we want to normalize this likelihood by scaling it according to the likelihood that anything my be fatal. To do so, we divide them (total fatalities / total of everything) and that yields the association ratio between amputation and fatality.

Identical results would be reached by first dividing fatal amputations by all fatalities (likelihood that a fatality is caused by amputation) and then dividing that by the average likelihood that an amputation is the cause of any disposition (total amputations / total of everything). This results in the exact same association ratio as the first approach.

Both approaches are the same algorithm run in opposite directions. They are also mathematically equivalent since they both result in the same calculation:

association between amputations and fatalities = (fatal amputations)*(total of everything) / (fatalities)*(amputations)

Originally, for efficiency, I used a specialized version of the aforementioned algorithm (calculate likelihoods then divide) in order to naturally cache totals and subtotals for multiple situations. Unfortunately, this led to a very complex and confusing algorithm.

To keep the algorithm simple, I have written a new one optimized to use the general formula as efficiently as possible. I have actually gotten it to be more efficient than the original algorithm. This algorithm is significantly less complex. It is more maintainable and easier to understand and use, so it is favored.

I still see some potential to optimize a few places in the algorithm to improve efficiency even further, but this would require a lot of benchmarking and will probably not be a huge improvement, so it is not my top priority.

Attribute Description
notable Minimum association ratio (or inverse) to be included.
significant Minimum number of occurrences (statistical significance).
assoc Associations organized by association then subgroup.
subpops Associations organized by subgroup then association.
hist Histogram() object to extract data from.
Method Description
add() Save association ratio.
find() Find the association ratio for every field value combination among a specific field name combination.

Associations()

Attributes: self.pairs and self.subpops contain all association ratios.

>>> self.pairs
{
	pair_type: {
		frozenset(association_pair): {
			frozenset(subgroup/subpopulation): association_ratio
		}
	}
}
>>> self.subpops
{
	subgroup_type: {
		frozenset(subgroup/subpopulation): {
			frozenset(association_pair): association_ratio
		}
	}
}
Method Description
find_all() Use multiprocessing pool to test every field name combination using Associator().find().
helper() Runs Associator().find(). Needed for multiprocessing.
add() Add entire Associator()’s data structures to Associations() object using merge().
merge() Lower level dictionary processor than add().
report() Report associations between two fields.
subgroup_report() Report associations for any pairs within a subgroup/subpopulation.

analysis.py

Contains two classes, Analysis() and AsciiTable()

Analysis()

Analyze data from Histogram() and Associations().

Attribute Description
hist Histogram()
assoc Associations()
gen_assoc Average association ratios for combo types.
maxes and mins Max and min association ratios for combo types.
Method Description
make_hist() Create data structure for a histogram plot.
prep_hist() Used by make_hist() to include only notable data.
plot_hist() Use data from make_hist() to create an actual plot.
plot_assoc() Use make_hist() and plot_hist() for specific purpose of plotting association ratios between two field names.
nice_plot_assoc() Try plot_assoc() with various notable values to create a legible plot containing meaningful data.
plot_all() Run nice_plot_assoc() for every field combination.
max_helper() Find mins and maxes while making hists.
most_common() Most common occurrences.
most_assoc() Most associated occurrences.
extremes() Most associated occurrences (broader).

AsciiTable()

Attribute Description
tables List of table strings.
Method Description
table() Draw ascii table.
table_section() Format data into a section to be interpreted by table().

Related Repositories

backbone-associations

backbone-associations

Create object hierarchies with Backbone models; Respond to hierarchy changes using regular Backbone events. ...

teamwork

teamwork

User to Team associations with invitation system for the Laravel 5 Framework ...

regressor

regressor

Generate specs for your rails application the easy way. Regressor generates model and controller specs based on validations, associations, enums, database, routes, callbacks. Use regressor to capture your rails application behaviour. ...

mpv-install

mpv-install

Sets up file associations for mpv on Windows ...

dirty_associations

dirty_associations

Adds dirty object-like behavior for Rails model associations. ...