Document : encoding

Binary description of the document format for a known document type ------------------------------------------------------------------- This format is defined to be compact and simple to implement but also capable to support large document streaming or fast access. 1) Value types -------------- base types can be in Little-Endian or Big-Endian. | CODE | NAME | DESCRIPTION | ENDIANNESS| | -----|-----------|------------------------------------------------------|-----------| | 0 | Bit bool | 1 bit interpreted as a boolean value | | | 1 | Byte bool | 1 byte interpreted as a boolean value | | | 2 | Int8 | signed byte | | | 3 | UInt8 | unsigned byte | | | 4 | Int16 | signed short (2 bytes) | LE | | 5 | UInt16 | unsigned short (2 bytes) | LE | | 6 | Int32 | signed integer (4 bytes) | LE | | 7 | UInt32 | unsigned integer (4 bytes) | LE | | 8 | Int64 | signed long (8 bytes) | LE | | 9 | UInt64 | unsigned long (8 bytes) | LE | | 10 | Float | IEEE 754 (4 bytes) | LE | | 11 | Double | IEEE 754 (8 bytes) | LE | | 12 | Int16 | signed short (2 bytes) | BE | | 13 | UInt16 | unsigned short (2 bytes) | BE | | 14 | Int32 | signed integer (4 bytes) | BE | | 15 | UInt32 | unsigned integer (4 bytes) | BE | | 16 | Int64 | signed long (8 bytes) | BE | | 17 | UInt64 | unsigned long (8 bytes) | BE | | 18 | Float | IEEE 754 (4 bytes) | BE | | 19 | Double | IEEE 754 (8 bytes) | BE | | 20 | VarUInt | variable size unsigned integer, (for 1 to 8 bytes) | | | 21 | Text | for strings | | | 22 | Doc | for sub documents | | | 23 | Variable | varying type, which can take any of the above values | | | 0x80 | Var Bits | Mask used for 1 to 64 bits fields | | All other types are defined as documents. *Text value* Text is a very common structure encode as : | TYPE | CARD | DESCRIPTION | |-----------|-------|-----------------------| | VarUInt | (1,1) | data array size | | UByte | (1,1) | ARRAY the text datas | The character encoding is given by the document field type *Varying value* The varying type is unknown before reading the document. All varying values are composed of a field description document and followed by the value itself. | TYPE | CARD | DESCRIPTION | |-----------|-------|-----------------------| | Doc | (1,1) | field value type | | XXXXX | (1,1) | Datas | 2) Binary structure - Streamed ----------------------------- The streamed form allow a quick writing with no backward cursor movement. The downside is an unpredictable document size until reading is finished. This form is more suited for tiny and very large documents. ``` Bytes 0 1 X Y +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+ | 0 | FIELD 1 DEF | FIELD 1 VALUE | FIELD 2 DEF | FIELD 2 VALUE | FIELD N DEF | FIELD N VALUE | +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+---+ - - - +---+ ``` |Offset| Description | |-----|---------------------------------------------------------| | 0-1 | signature, 0x00 for streamed | | 1-X | field définition if the type do not define all elements | | X-Y | field value if values are present | *PSEUDO-CODE* ``` FUNCTION Document readDocument(Input in, DocumentType docType) Document doc = Document.new() FOR EACH field IN docType.fields THEN # read number of occurences Int nbOcc = readNbOccurence() IF nbOcc != 0 THEN # read field value doc type if undefined DocumentType fieldDocType = field.docType IF field.type = 'document' AND field.docType = null THEN Document encodedDocType = readDocument(in, SCHEMA_DOCTYPE) fieldDocType = toDocumentType(encodedDocType) END IF # read field values IF nbOcc = 1 AND field.maxOcc = 1 THEN # read a single value doc.setFieldValue(field.id, readValue(in, field)) ELSE IF nbOcc = -1 THEN # read a streamed collection List values = List.new() WHILE in.readByte() != 0 values.add(readValue(in, field)) END WHILE doc.setFieldValue(field.id, values) ELSE # read a defined size collection List values = List.new(nbOcc) WHILE in.readByte() != 0 values.add(readValue(in, field)) END WHILE doc.setFieldValue(field.id, values) END IF END IF END FOR RETURN doc END FUNCTION FUNCTION Int readNbOccurence(Input in, Field field) Int nbOcc = field.max_occ IF field.minOcc = 0 AND field.maxOcc = 1 THEN IF in.readBit() == 0 THEN nbOcc = 0 ELSE nbOcc = 1 END IF ELSE IF field.minOcc != field.maxOcc THEN Int n = in.readVarUInt() IF n = 0 THEN nbOcc = -1 ELSE nbOcc = field.minOcc + n - 1 END IF END IF RETURN nbOcc END FUNCTION FUNCTION Object readValue(Input in, Field field) # read array size Int[] arraySize = field.arraySize IF arraySize != null THEN FOR i = 0 TO arraySize.length IF arraySize[i] <=0 arraySize[i] = in.readVarUInt() END IF END FOR END IF # field values IF arraySize != null THEN RETURN readArrayValues(in, field, arraySize, 0) ELSE RETURN readSingleValue(in, field) END IF END FUNCTION FUNCTION Object readArrayValues(Input in, Field field, Int[] arraySize, int depth) Object[] values = new Object[array[depth]] FOR i = 0 TO values.length IF depth = array.length-1 THEN values[i] = readSingleValue(in, valueType) ELSE values[i] = readArrayValues(in, valueType, arraySize, depth+1) END IF END FOR RETURN values END FUNCTION FUNCTION Object readSingleValue(Input in, Field field) SWITCH field.type CASE 'Bit' : RETURN in.readBit() != 0 CASE 'ByteBool' : RETURN in.readByte() != 0 CASE 'Int8' : RETURN in.readByte() CASE 'UInt8' : RETURN in.readUByte() CASE 'Int16_BE' : RETURN in.readShortBE() CASE 'UInt16_BE' : RETURN in.readUShortBE() CASE 'Int32_BE' : RETURN in.readIntBE() CASE 'UInt32_BE' : RETURN in.readUIntBE() CASE 'Int64_BE' : RETURN in.readLongBE() CASE 'UInt64_BE' : RETURN in.readULongBE() CASE 'Float_BE' : RETURN in.readFloatBE() CASE 'Double_BE' : RETURN in.readDoubleBE() CASE 'Int16_LE' : RETURN in.readShortLE() CASE 'UInt16_LE' : RETURN in.readUShortLE() CASE 'Int32_LE' : RETURN in.readIntLE() CASE 'UInt32_LE' : RETURN in.readUIntLE() CASE 'Int64_LE' : RETURN in.readLongLE() CASE 'UInt64_LE' : RETURN in.readULongLE() CASE 'Float_LE' : RETURN in.readFloatLE() CASE 'Double_LE' : RETURN in.readDoubleLE() CASE 'VarUInt' : RETURN in.readVarUInt(); CASE 'VarBits' : RETURN in.readBits(field.nbBits); CASE 'Text' : RETURN Chars.new(in.readBytes(in.readVarUInt()), field.charEncoding) CASE 'Document' : RETURN readDocument(in, field.docType, field.inline) CASE 'Variable' : Document typeDoc = readDocument(in, SCHEMA_FIELDVALUETYPE) FieldValueType type = toFieldValueType(typeDoc) RETURN readValue(in, type) END SWITCH END FUNCTION ``` 3) Binary structure - Indexed ----------------------------- The indexed form allow quick skipping and access to properties. The downside is a slightly bigger file and backward cursor positioning when writing. ``` Bytes 0 1 2 X Y Z +---+---+ - - - + - - - +---+ - - - +---+---+ - - - +---+ | 1 | S | SIZE1 | SIZEN | FIELD N DEF | FIELD N VALUE | +---+---+ - - - + - - - +---+ - - - +---+---+ - - - +---+ ``` |Offset| Description | |-----|------------------------------------------------------------------------| | 0-1 | marker, 0x01 for indexed | | 1-2 | number of bytes used to store a field size | | 2-X | size in bytes of each document field, the number of values is given by | | | the document type, the total document size can be calculated using the | | | formula : sum ( size1 ... sizeN ) + 2 | | X-Y | field definition if the type do not define all elements | | Y-Z | field value if values are present | The fields definitions and values use the same structure as in the streamed form 4) Binary structure - Encapsulated ---------------------------------- The encapsulated form is intended to be used for compression and encryption needs. The encapsulated document is a document in any form, this allows the combine several layers of encapsulation. ``` Bytes 0 1 X Y +---+---+ - - - +---+---+ - - - +---+ | 2 | METHOD | ENC. SIZE | +---+---+ - - - +---+---+ - - - +---+ ``` |Offset| Description | |-----|-------------------------------------------------------------------------| | 0-1 | signature, 0x02 for encapsulated | | 1-X | String in UTF-8 to identify the mehod. | | | for example [0x03,'Z','I','P'] or [0x03,'A','E','S'] | | X-Y | variable size integer to indicate the encapsulated document size | If the encapsulated size if not zero the complete encapsulated document is in the next bytes. ``` Bytes Y Z +---+ - - - +---+ | COMP. DOC. | +---+ - - - +---+ ``` |Offset| Description | |-----|---------------------------------------------------------------------------| | Y-Z | the encapsulated document on N bytes, N being the number defined at [X-Y[ | If the encapsulated size is zero the encapsulated document is split in blocks of fixed sizes. ``` Bytes Y Z T T+1 +---+ - - - +---+---+ - - - +---+---+---+ - - - +---+---+ | BLOCK SIZE | BLOCK 1 | F | BLOCK N | F | +---+ - - - +---+---+ - - - +---+---+---+ - - - +---+---+ ``` | Offset| Description | |-------|-------------------------------------------------------| | Y-Z | variable size integer to define size of the blocks | | Z-T | block | | T-T+1 | block flag, if value is zero, this was the last block | 5) Binary structure - Reference ----------------------------- In some cases it is necessary to define cyclic, backward or distant references. The reference binary structure encodes and UTF-8 string which points toward the document. Common cases include URL, URN or file paths but those are not restricted. ``` Bytes 0 1 X +---+---+ - - - +---+ | 3 | REFERENCE | +---+---+ - - - +---+ ``` |Offset| Description | |-----|-------------------------------------------| | 0-1 | signature, 0x03 for reference | | 1-X | Reference String in UTF-8 | 6) Binary structure - Deleted ----------------------------- Documents can be deleted, this particular structure allows document files to be modified without rewriting the entire file. Decoders must skip those documents when they occur. ``` Bytes 0 1 X Y +---+---+ - - - +---+---+ - - - +---+ |255| DOC SIZE | PADDING | +---+---+ - - - +---+---+ - - - +---+ ``` |Offset| Description | |-----|-----------------------------------------------------------------------| | 0-1 | signature, 0xFF for deleted | | 1-X | VarUInt, size of the deleted document, the size includes only the | | | padding length | | X-Y | bytes to skip, may contain any kind of data, encoders should fill it | | | with random or constant values for security reasons. |