Friday, April 29, 2011

Java in-memory compression

Sometimes it would be nice to have a means to compress seldom used objects in memory to time for space (since compression/decompression takes a bit of overhead time). Especially in chemoinformatics applications where objects often carry substantial compactible information, e.g. textual representations of structures, the compression ratio is quite noticable.

Sure, one could do it for every variable individually, but how about compressing whole Objects using the standard serializing mechanism?

If you use WEKA, you have the SerializedObject.

If not - enter the CompressedReference:

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;

public class CompressedReference<T extends Serializable> implements Serializable {

private static final long serialVersionUID = 7967994340450625830L;

private byte[] theCompressedReferent = null;

public CompressedReference(T referent) {
try {
compress(referent);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

public int size() {
return theCompressedReferent.length;
}

public T get() {
try {
return decompress();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}

private void compress(T referent) throws IOException {

ByteArrayOutputStream bos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(bos);
ObjectOutputStream ous = new ObjectOutputStream(zos);

ous.writeObject(referent);

zos.finish();

bos.flush();

theCompressedReferent = bos.toByteArray();

bos.close();
}

@SuppressWarnings("unchecked")
private T decompress() throws IOException, ClassNotFoundException {
T tmpObject = null;
ByteArrayInputStream bis = new ByteArrayInputStream(theCompressedReferent);
GZIPInputStream zis = new GZIPInputStream(bis);
ObjectInputStream ois = new ObjectInputStream(zis);
tmpObject = (T) ois.readObject();

ois.close();

return tmpObject;
}
}

A quick test shows 528 byte size for a String of 250 characters (since Unicode needs two bytes per char) and 64 bytes after compression, a ratio of about 8:1. The only requirement is that the Object stored in the reference has to implement Serializable.

And yes, I have to clear those TODO reminders... ;-)

Update
The compression ratio for a CDK Molecule representation of 2-Fluoronaphthalene is about 3:1.
The compression ratio for the V2000 mofile of 2-Fluoronaphthalene is about 7:1.