Hachoir is the French name for a mincer: a tool used by butchers to cut meat. Hachoir is also a tool written for hackers to cut a file or any binary stream. A file is split in a tree of fields where the smallest field can be a bit. There are various field types: integer, string, bits, padding, sub file, etc.
This document is a presentation of Hachoir API. It tries to show most the interesting part of this tool, but is not exhaustive. Ok, let's start!
To split data we first need is to get data :-) So this section presents the "hachoir.stream" API.
In most cases we work on files using the FileInputStream function. This function takes one argument: a Unicode filename. But for practical reasons we will use StringInputStream function in this documentation.
>>> data = "point\0\3\0\2\0" >>> from hachoir_core.stream import StringInputStream, LITTLE_ENDIAN >>> stream = StringInputStream(data) >>> stream.source '<string>' >>> len(data), stream.size (10, 80) >>> data[1:6], stream.readBytes(8, 5) ('oint\x00', 'oint\x00') >>> data[6:8], stream.readBits(6*8, 16, LITTLE_ENDIAN) ('\x03\x00', 3) >>> data[8:10], stream.readBits(8*8, 16, LITTLE_ENDIAN) ('\x02\x00', 2)
First big difference between a string and a Hachoir stream is that sizes and addresses are written in bits and not bytes. The difference is a factor of eight, that's why we write "6*8" to get the sixth byte for example. You don't need to know anything else to use Hachoir, so let's play with fields!
We will parse the data used in the last section.
>>> from hachoir_core.field import Parser, CString, UInt16 >>> class Point(Parser): ... endian = LITTLE_ENDIAN ... def createFields(self): ... yield CString(self, "name", "Point name") ... yield UInt16(self, "x", "X coordinate") ... yield UInt16(self, "y", "Y coordinate") ... >>> point = Point(stream) >>> for field in point: ... print "%s) %s=%s" % (field.address, field.name, field.display) ... 0) name="point" 48) x=3 64) y=2
point is a the root of our field tree. This tree is really simple, it just has one level and three fields: name, x and y. Hachoir stores a lot of information in each field. In this example we just show the address, name and display attributes. But a field has more attributes:
>>> x = point["x"] >>> "%s = %s" % (x.path, x.value) '/x = 3' >>> x.parent == point True >>> x.description 'X coordinate' >>> x.index 1 >>> x.address, x.absolute_address (48, 48)
The index is not the index of a field in a parent field list, '1' means that it's the second since the index starts at zero.
After learning basic API, let's see a more complex parser: parser with sub-field sets.
>>> from hachoir_core.field import FieldSet, UInt8, Character, String >>> class Entry(FieldSet): ... def createFields(self): ... yield Character(self, "letter") ... yield UInt8(self, "code") ... >>> class MyFormat(Parser): ... endian = LITTLE_ENDIAN ... def createFields(self): ... yield String(self, "signature", 3, charset="ASCII") ... yield UInt8(self, "count") ... for index in xrange(self["count"].value): ... yield Entry(self, "point[]") ... >>> data = "MYF\3a\0b\2c\0" >>> stream = StringInputStream(data) >>> root = MyFormat(stream)
This example presents many interesting features of Hachoir. First of all, you can see that you can have two or more levels of fields. Here we have a tree with two levels:
>>> def displayTree(parent): ... for field in parent: ... print field.path ... if field.is_field_set: displayTree(field) ... >>> displayTree(root) /signature /count /point[0] /point[0]/letter /point[0]/code /point[1] /point[1]/letter /point[1]/code /point[2] /point[2]/letter /point[2]/code
A field set is also a field, so it has the same attributes than another field (name, address, size, path, etc.) but has some new attributes like stream or root.
Hachoir is written in Python so it should be slow and eat a lot of CPU and memory, and it does. But in most cases, you don't need to explore an entire field set and read all values; you just need to read some values of some specific fields. Hachoir is really lazy: no field is parsed before you ask for it, no value is read from stream before you read a value, etc. To inspect this behaviour, you can watch "current_length" (number of read fields) and "current_size" (current size in bits of a field set):
>>> root = MyFormat(stream) # Rebuild our parser >>> print (root.current_length, root.current_size) (0, 0) >>> print root["signature"].display "MYF" >>> print (root.current_length, root.current_size, root["signature"].size) (1, 24, 24)
Just after its creation, a parser is empty (0 fields). When we read the first field, its size becomes the size of the first field. Some operations requires to read more fields:
>>> print root["point[0]/letter"].display 'a' >>> print (root.current_length, root.current_size) (3, 48)
Reading point[0] needs to read field "count". So root now contains three fields.
Number:
Text:
Timestamp (date and time):
Timedelta (duration):
- TimedeltaWin64: 64-bit Windows, number of 1/10 microseconds
Padding and raw bytes:
To create your own type, you can use:
Read only attributes:
Read only and lazy attributes:
Method that can be replaced:
Aliases (method):
Other methods:
Read only attributes:
Read only and lazy attributes:
Methods:
Lazy methods: