Binary description of the document format for a known document type
-------------------------------------------------------------------
This format is defined to be compact and simple to implement but also
capable to support large document streaming or fast access.
1) Value types
--------------
base types can be in Little-Endian or Big-Endian.
| CODE | NAME | DESCRIPTION | ENDIANNESS|
| -----|-----------|------------------------------------------------------|-----------|
| 0 | Bit bool | 1 bit interpreted as a boolean value | |
| 1 | Byte bool | 1 byte interpreted as a boolean value | |
| 2 | Int8 | signed byte | |
| 3 | UInt8 | unsigned byte | |
| 4 | Int16 | signed short (2 bytes) | LE |
| 5 | UInt16 | unsigned short (2 bytes) | LE |
| 6 | Int32 | signed integer (4 bytes) | LE |
| 7 | UInt32 | unsigned integer (4 bytes) | LE |
| 8 | Int64 | signed long (8 bytes) | LE |
| 9 | UInt64 | unsigned long (8 bytes) | LE |
| 10 | Float | IEEE 754 (4 bytes) | LE |
| 11 | Double | IEEE 754 (8 bytes) | LE |
| 12 | Int16 | signed short (2 bytes) | BE |
| 13 | UInt16 | unsigned short (2 bytes) | BE |
| 14 | Int32 | signed integer (4 bytes) | BE |
| 15 | UInt32 | unsigned integer (4 bytes) | BE |
| 16 | Int64 | signed long (8 bytes) | BE |
| 17 | UInt64 | unsigned long (8 bytes) | BE |
| 18 | Float | IEEE 754 (4 bytes) | BE |
| 19 | Double | IEEE 754 (8 bytes) | BE |
| 20 | VarUInt | variable size unsigned integer, (for 1 to 8 bytes) | |
| 21 | Text | for strings | |
| 22 | Doc | for sub documents | |
| 23 | Variable | varying type, which can take any of the above values | |
| 0x80 | Var Bits | Mask used for 1 to 64 bits fields | |
All other types are defined as documents.
*Text value*
Text is a very common structure encode as :
| TYPE | CARD | DESCRIPTION |
|-----------|-------|-----------------------|
| VarUInt | (1,1) | data array size |
| UByte | (1,1) | ARRAY the text datas |
The character encoding is given by the document field type
*Varying value*
The varying type is unknown before reading the document.
All varying values are composed of a field description document and
followed by the value itself.
| TYPE | CARD | DESCRIPTION |
|-----------|-------|-----------------------|
| Doc | (1,1) | field value type |
| XXXXX | (1,1) | Datas |
2) Binary structure - Streamed
-----------------------------
The streamed form allow a quick writing with no backward cursor movement.
The downside is an unpredictable document size until reading is finished.
This form is more suited for tiny and very large documents.
```
Bytes 0 1 X Y
+---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+
| 0 | FIELD 1 DEF | FIELD 1 VALUE | FIELD 2 DEF | FIELD 2 VALUE | FIELD N DEF | FIELD N VALUE |
+---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+
```
|Offset| Description |
|-----|---------------------------------------------------------|
| 0-1 | signature, 0x00 for streamed |
| 1-X | field définition if the type do not define all elements |
| X-Y | field value if values are present |
*PSEUDO-CODE*
```
FUNCTION Document readDocument(Input in, DocumentType docType)
Document doc = Document.new()
FOR EACH field IN docType.fields THEN
# read number of occurences
Int nbOcc = readNbOccurence()
IF nbOcc != 0 THEN
# read field value doc type if undefined
DocumentType fieldDocType = field.docType
IF field.type = 'document' AND field.docType = null THEN
Document encodedDocType = readDocument(in, SCHEMA_DOCTYPE)
fieldDocType = toDocumentType(encodedDocType)
END IF
# read field values
IF nbOcc = 1 AND field.maxOcc = 1 THEN
# read a single value
doc.setFieldValue(field.id, readValue(in, field))
ELSE IF nbOcc = -1 THEN
# read a streamed collection
List values = List.new()
WHILE in.readByte() != 0
values.add(readValue(in, field))
END WHILE
doc.setFieldValue(field.id, values)
ELSE
# read a defined size collection
List values = List.new(nbOcc)
WHILE in.readByte() != 0
values.add(readValue(in, field))
END WHILE
doc.setFieldValue(field.id, values)
END IF
END IF
END FOR
RETURN doc
END FUNCTION
FUNCTION Int readNbOccurence(Input in, Field field)
Int nbOcc = field.max_occ
IF field.minOcc = 0 AND field.maxOcc = 1 THEN
IF in.readBit() == 0 THEN
nbOcc = 0
ELSE
nbOcc = 1
END IF
ELSE IF field.minOcc != field.maxOcc THEN
Int n = in.readVarUInt()
IF n = 0 THEN
nbOcc = -1
ELSE
nbOcc = field.minOcc + n - 1
END IF
END IF
RETURN nbOcc
END FUNCTION
FUNCTION Object readValue(Input in, Field field)
# read array size
Int[] arraySize = field.arraySize
IF arraySize != null THEN
FOR i = 0 TO arraySize.length
IF arraySize[i] <=0
arraySize[i] = in.readVarUInt()
END IF
END FOR
END IF
# field values
IF arraySize != null THEN
RETURN readArrayValues(in, field, arraySize, 0)
ELSE
RETURN readSingleValue(in, field)
END IF
END FUNCTION
FUNCTION Object readArrayValues(Input in, Field field, Int[] arraySize, int depth)
Object[] values = new Object[array[depth]]
FOR i = 0 TO values.length
IF depth = array.length-1 THEN
values[i] = readSingleValue(in, valueType)
ELSE
values[i] = readArrayValues(in, valueType, arraySize, depth+1)
END IF
END FOR
RETURN values
END FUNCTION
FUNCTION Object readSingleValue(Input in, Field field)
SWITCH field.type
CASE 'Bit' : RETURN in.readBit() != 0
CASE 'ByteBool' : RETURN in.readByte() != 0
CASE 'Int8' : RETURN in.readByte()
CASE 'UInt8' : RETURN in.readUByte()
CASE 'Int16_BE' : RETURN in.readShortBE()
CASE 'UInt16_BE' : RETURN in.readUShortBE()
CASE 'Int32_BE' : RETURN in.readIntBE()
CASE 'UInt32_BE' : RETURN in.readUIntBE()
CASE 'Int64_BE' : RETURN in.readLongBE()
CASE 'UInt64_BE' : RETURN in.readULongBE()
CASE 'Float_BE' : RETURN in.readFloatBE()
CASE 'Double_BE' : RETURN in.readDoubleBE()
CASE 'Int16_LE' : RETURN in.readShortLE()
CASE 'UInt16_LE' : RETURN in.readUShortLE()
CASE 'Int32_LE' : RETURN in.readIntLE()
CASE 'UInt32_LE' : RETURN in.readUIntLE()
CASE 'Int64_LE' : RETURN in.readLongLE()
CASE 'UInt64_LE' : RETURN in.readULongLE()
CASE 'Float_LE' : RETURN in.readFloatLE()
CASE 'Double_LE' : RETURN in.readDoubleLE()
CASE 'VarUInt' : RETURN in.readVarUInt();
CASE 'VarBits' : RETURN in.readBits(field.nbBits);
CASE 'Text' : RETURN Chars.new(in.readBytes(in.readVarUInt()), field.charEncoding)
CASE 'Document' : RETURN readDocument(in, field.docType, field.inline)
CASE 'Variable' :
Document typeDoc = readDocument(in, SCHEMA_FIELDVALUETYPE)
FieldValueType type = toFieldValueType(typeDoc)
RETURN readValue(in, type)
END SWITCH
END FUNCTION
```
3) Binary structure - Indexed
-----------------------------
The indexed form allow quick skipping and access to properties. The downside is
a slightly bigger file and backward cursor positioning when writing.
```
Bytes 0 1 2 X Y Z
+---+---+ - - - + - - - +---+ - - - +---+---+ - - - +---+
| 1 | S | SIZE1 | SIZEN | FIELD N DEF | FIELD N VALUE |
+---+---+ - - - + - - - +---+ - - - +---+---+ - - - +---+
```
|Offset| Description |
|-----|------------------------------------------------------------------------|
| 0-1 | marker, 0x01 for indexed |
| 1-2 | number of bytes used to store a field size |
| 2-X | size in bytes of each document field, the number of values is given by |
| | the document type, the total document size can be calculated using the |
| | formula : sum ( size1 ... sizeN ) + 2 |
| X-Y | field definition if the type do not define all elements |
| Y-Z | field value if values are present |
The fields definitions and values use the same structure as in the streamed form
4) Binary structure - Encapsulated
----------------------------------
The encapsulated form is intended to be used for compression and encryption needs.
The encapsulated document is a document in any form, this allows the combine
several layers of encapsulation.
```
Bytes 0 1 X Y
+---+---+ - - - +---+---+ - - - +---+
| 2 | METHOD | ENC. SIZE |
+---+---+ - - - +---+---+ - - - +---+
```
|Offset| Description |
|-----|-------------------------------------------------------------------------|
| 0-1 | signature, 0x02 for encapsulated |
| 1-X | String in UTF-8 to identify the mehod. |
| | for example [0x03,'Z','I','P'] or [0x03,'A','E','S'] |
| X-Y | variable size integer to indicate the encapsulated document size |
If the encapsulated size if not zero the complete encapsulated document is in the
next bytes.
```
Bytes Y Z
+---+ - - - +---+
| COMP. DOC. |
+---+ - - - +---+
```
|Offset| Description |
|-----|---------------------------------------------------------------------------|
| Y-Z | the encapsulated document on N bytes, N being the number defined at [X-Y[ |
If the encapsulated size is zero the encapsulated document is split in blocks of
fixed sizes.
```
Bytes Y Z T T+1
+---+ - - - +---+---+ - - - +---+---+---+ - - - +---+---+
| BLOCK SIZE | BLOCK 1 | F | BLOCK N | F |
+---+ - - - +---+---+ - - - +---+---+---+ - - - +---+---+
```
| Offset| Description |
|-------|-------------------------------------------------------|
| Y-Z | variable size integer to define size of the blocks |
| Z-T | block |
| T-T+1 | block flag, if value is zero, this was the last block |
5) Binary structure - Reference
-----------------------------
In some cases it is necessary to define cyclic, backward or distant references.
The reference binary structure encodes and UTF-8 string which points toward the
document. Common cases include URL, URN or file paths but those are not restricted.
```
Bytes 0 1 X
+---+---+ - - - +---+
| 3 | REFERENCE |
+---+---+ - - - +---+
```
|Offset| Description |
|-----|-------------------------------------------|
| 0-1 | signature, 0x03 for reference |
| 1-X | Reference String in UTF-8 |
6) Binary structure - Deleted
-----------------------------
Documents can be deleted, this particular structure allows document files to be
modified without rewriting the entire file. Decoders must skip those documents
when they occur.
```
Bytes 0 1 X Y
+---+---+ - - - +---+---+ - - - +---+
|255| DOC SIZE | PADDING |
+---+---+ - - - +---+---+ - - - +---+
```
|Offset| Description |
|-----|-----------------------------------------------------------------------|
| 0-1 | signature, 0xFF for deleted |
| 1-X | VarUInt, size of the deleted document, the size includes only the |
| | padding length |
| X-Y | bytes to skip, may contain any kind of data, encoders should fill it |
| | with random or constant values for security reasons. |