Showing revision 13

JData for Beginners

  • By Qianqian Fang <q.fang at neu.edu>
  • Status: under-development, please forgive typos and language issues, will polish once complete

JData is here to help! It is not a new format, and is very simple, easy-to-adopt. Read below to see why.

1. Why do we need JData?
2. What exactly is JData?
2.1. JData in a nutshell
2.2. Introduction by examples
2.2.1. A simple array
2.2.2. A simple array with a binary type
2.2.3. A simple array with compression
2.2.4. A structure
2.2.5. Using binary interface
2.2.6. Quasi-readability of binary JData files
2.2.7. Space savings

1. Why do we need JData?

There are sayings that we are living in the era of big data. You don't have to agree on that, but we all have to deal with data and files from time to time, and we all know that there are numerous file formats that require specific software to open them. In most cases, you can't just open a file using a text editor and make sense out of it because those files are mostly binary and not directly human-readable.

This is bad for scientists -- people who generate a lot of complicated data and process them to understand what's going on in our world -- because they have to deal with increasing types of data files from increasingly complicated experiments. This is also bad for software developers -- people who write software to read/process/display certain type(s) of data and help you understand and process them -- because they have to constantly add libraries and parsers to deal with newly emerged formats and, if the format is updated, their future revisions. This is a lot of work and overhead to keep-up, and costly too.

The result of this is that software becomes bulky, with large number of dependencies, and increasingly difficult to maintain and improve. On the other hand, the numerous language-/application-/device-specific data files, sometimes storing precious experimental data that costed a fortune to collect, are difficult to be shared, reused and understood, because you need a lot of dependent software tools to open them and understand. Even worse, many existing data formats utilizes rigid binary structures that are nearly impossible to extend (such as adding or removing something) without breaking backward-compatibility! Once a data format is revised, most data stored in the old format inevitably become outdated because no developer is interested in maintaining a parser of an outdated format.

People have noticed this problem, and had come up with a number of solutions, but each solution has met their challenges as well. For example:

  • Extensible Markup Language (XML): a general-purpose data format that is designed to be easily extensible, human-readable and versatile, but it suffers from verbosity, and difficulties when parsing
  • JavaScript Object Notation (JSON): a light-weight, simple yet highly extensible, human-readable format for storing complex hierarchical data (probably the most popular data format currently supported), but it suffers from 1) lack of strongly-typed data support, 2) large file sizes and parsing overhead compared to binary files
  • Binary JSON-like formats such as BSON, MessagePack, CBOR: these formats inherit the simplicity and versatility of JSON, and provide better efficient and smaller file size, but nearly all of them lose human-readability
  • Hierarchical Data Format (HDF5): a sophisticated, highly versatile, extensible and high-performance binary format targeted for storing large amount of scientific data, but to support HDF5 files you will have to know how to program its APIs and those have a learning curve. The file is also not human-readable and you must have the libraries pre-installed to understand the files. The overheads to store lightweight datasets can also be significant.

On top of this, most popular programming languages that are used to make data processing software had developed a series of their own complex data types to efficiently process diverse input data and represent sophisticated relations, but it is currently lacking a common interface for these software to exchange such complex data types from software written in one language to another. For example, the data types data scientists or researchers may often use include

Data types Python MATLAB Perl JavaScript C++
flexible array list() cell() array array -
hash/map dict() struct()/containers.Map() %hash object, Map() std::map
packed-typed-arrays numpy.NDArray matrices PDL::matrix TypedArray, numjs.NdArray std::array
tables pandas.DataFrame table Data::Table - -
linked lists - - - - std::list
graph - graph()/digraph() - - -
blob bytes/bytearray uint8/int8/char vector File::BLOB Blob() void*
complex complex/numpy.dtype=complex complex Math::Complex numjs.ndarray(dtype='complex') std::complex
sparse array scipy.sparse sparse - Map() -
special matrices scipy.sparse sparse - - -

    • "-" indicates the lack of built-in support and canonical 3rd-party library support

To solve this problem, we need a data representation (and/or) storage format that is:

  1. highly portable - it must be language-/platform-independent
  2. highly extensible - you can add/remove/merge data without breaking existing parsers
  3. highly versatile - to encapsulate complex hierarchical data and allow easy split, merge, and integration
  4. simple and intuitive - that anyone without training can easily understand and adopt, and if possible,
  5. human-readable - human-readability is the ultimate criteria for future-proof data storage, because the data can be easily understood without special tools

The above mentioned general-purpose data formats - XML, JSON, binary JSON or HDF5 - all have potentials, but all need improvements in order to address their limitations. In addition, we need standardization - the representation of a complex data type, event though there are numerous ways to do it, need to have a clearly defined syntax/specification so that they can be unambiguously encoded, shared, and exchanged among software and programming languages.

That is what we are trying to achieve with JData. JData is a standardized, portable middleware-like data annotation protocol that allows one to store complex data types, ranging from simple numerical arrays and values to sparse, complex, hierarchical graph-like data, easily, efficiently, and unambiguously. Such data annotation can also be stored as compact, human-readable files and share among different programs.

2. What exactly is JData?

2.1. JData in a nutshell

JData is our solution to help data creators and users, especially scientific researchers who routinely deal with complex data, to easily store, read, understand, share, combine, process, exchange and integrate their data and enable new discoveries. They can save all the extra works to deal with numerous diverse file formats, while focusing on addressing the more important questions. Their hard-work generated data files can be also be shared with a large research community by making them FAIR - findable, accessible, interoperable and reusable, helping other researchers to create bigger studies and enable grander discoveries.

We have developed the JData framework to specifically address the above mentioned challenges. This framework includes

  1. A set of data specifications that map complex data structures to an standardized "data annotation" that can ambiguously represent a large variety of data types encountered in the scientific world
  2. A standardized data serialization approach based on existing, widely supported data serialization formats (specifically, we picked JSON and binary JSON - yes, dual-interface!), so that the JData-annotated constructs can be efficiently parsed/stored/converted/transmitted without needing extra works to write brand new parsers, and
  3. A series of libraries and encoders/decoders to map complex data types from different programming environment to the common JData annotation format, export and import between programming languages using a JSON/binary JSON file

In a nutshell, the JData Specification represents complex data structures in the JSON (JavaScript Object Notation) compatible data constructs, making it possible to make complex (and lightweight) data structures readable, shareable, extensible and interoperable between software and analysis pipelines. The JData specification extends JSON by adding additional data containers to support binary strongly-typed data, complex-valued arrays, data compression, data grouping, linking/referencing etc completely in the "semantic layer" without altering the syntax of the format. This makes JData-encoded data files 100% compatible with all existing JSON parsers and readily usable in software where JSON is supported.

To make JData suitable in many space-limited performance-sensitive applications, we also defined a binary interface for the JData-defined data annotations, utilizing another widely supported binary JSON-like format - Universal Binary JSON (UBJSON) to produce files of smaller sizes and significantly faster parsing speed. JSON/text-JData files can be lossly converted to a UBJSON/binary-JData file, and vice versa. The reason we selected UBJSON from other binary alternatives, i.e. BSON, MessagePack, CBOR, is largely due to its "quasi-human-readability" that is unique to this binary format. In UBJSON, all "semantic elements", i.e. the data type markers and data name records are all stored in human-readable forms. By using a text-editor or a string-printing tool, such as the strings command on Mac or Linux (also available on Windows via Cygwin), one can read the content of the binary data without much difficulties.

Simplicity and human-readability are among our requirements in the design of this framework because we understand that for a data standard targeting for general audiences, it has to be simple! The format has to be simple and intuitive, the serialization, reading/writing of the file has to be easily and widely acceptable without much programming overhead, and future extension must be inherent - once a data file is created, it must be easily modified without worrying about breaking the parsers, so it can stay for a long time. Human-readability is harder to achieve, because it typically conflicts with speed and data size. However, our solution attempts to strike a balance between readability and efficiency, and ensures that all "semantic components", such as data item names, strings, and data types, are easily readable and understood, while the binary data payload can be stored with compact size with compression or filters.

2.2. Introduction by examples

2.2.1. A simple array

Let's start with a simple example. Assume we have have a 2x4 integer array 'mydata', which one can define similarly in several programming languages as

 Python:     mydata=[[1,2,3,4], [5,6,7,8]]
 JavaScript: mydata=[[1,2,3,4], [5,6,7,8]]
 MATLAB:     mydata=[[1,2,3,4]; [5,6,7,8]]
 Perl:      @mydata=[[1,2,3,4], [5,6,7,8]]

In JData, we can represent this data array using the widely-supported, plain-text-based JSON format as

 {
    "mydata": [[1,2,3,4], [5,6,7,8]]
 }

From the above example, you may notice several major characteristics of JData

  • P1 - First, this representation is quite intuitive and easy to understand; one can use a text-editor and open the file and understand the data without much trouble.
  • P2 - Secondly, JData form is syntactically compatible with JSON, and JSON encoders are freely available in almost all programming languages, and they are easy to program or use.

2.2.2. A simple array with a binary type

Now, let's make this better - imagine that we want to store this array using an unsigned byte, i.e, uint8 format. The above JSON based formats started to meet its limitations - because the JSON specification does not support strongly-typed binary data.

In JData, we introduce a new data representation - the annotated array format to let JSON support strongly-typed binary data

 {
    "mydata": {
        "_ArrayType_": "uint8",
        "_ArraySize_": [2,4],
        "_ArrayData_": [1,2,3,4,5,6,7,8]
     }
 }

The above annotated format does look slightly more complicated than the first form (referred to as the direct format), but it remains easily understandable - more importantly, the annotated-array remains fully syntactically compatible with the JSON format - in other words, this JSON data representation is now capable of encoding data types.

2.2.3. A simple array with compression

We can go one step further, JData extends JSON's capability further to represent even binary data. For example, the binary data in the above array can be stored in an encoded string, and such binary can be even compressed to save space, for example:

 {
    "mydata": {
        "_ArrayType_": "uint8",
        "_ArraySize_": [2,4],
        "_ArrayZipType_": "zlib",
        "_ArrayZipSize_": [1,8],
        "_ArrayZipData_": "eJxjZGJmYWVj5wAAAIAAJQ=="
     }
 }

The above JSON construct stores the 8-element uint8 array above by first compressing the byte streams using zlib-deflate algorithm first (as indicated by the _ArrayZipType_) and then converted to ASCII text using base64-encoding. At this point, we use a JSON-compliant annotation format to store strongly typed binary data, which is previously not supported by the native JSON specification.

2.2.4. A structure

Hierarchical data are natively supported by JSON and UBJSON, therefore, it is effortless for JData files to encode and store nested complex data structures. For example, the below commands in respective programming languages

 Python:     mydata={'a':5,'f':1.1,'c':{'d':[[1,2,3,4,5],[6,7,8,9,10]],'s':'a string'}}
 JavaScript: mydata=('a':5,'f':1.1,'c':('d':[[1,2,3,4,5],[6,7,8,9,10]],'s':'a string'))
 MATLAB:     mydata=struct('a',5,'f',1.1,'c',struct('d',[[1,2,3,4,5]; [6,7,8,9,10]],'s','a string'))
 Perl:      %mydata=('a'=>5,'f'=>1.1,'c'=>('d'=>[[1,2,3,4,5], [6,7,8,9,10]],'s'=>'a string'))

produces the below native data structure

 mydata= {
    a =  5
    f =  1.1000
    c =
        d =
           1   2   3   4   5
           6   7   8   9  10
        s = a string
 }

In JSON/text-JData format, the above data can be intuitively represented as

 {
    "mydata": {
        "a": 5,
        "f": 1.1,
        "c": {
           "d": [
            [1,2,3,4, 5], 
            [6,7,8,9,10]
           ],
           "s": "a string"
        }
     }
 }
The above representation is rather easy to read and be understood without needing special tools, libraries or prior knowledge of the data format itself.

2.2.5. Using binary interface

One may argue that, although the above data representation is general, extensible and human readable, it requires more space to store and not very efficient when processing in large quantities.

This concern is valid, and is the main motivation for the "binary JData" interface using UBJSON.

Let's first understand why the above argument is valid. If we use the above JData form and use a tab \t for indentation and a newline \n for line wrapping, then the above form requires a total of 124 bytes to be stored on a disk. Because white-space between data records, including newlines and tabs are optional in JSON, stripping such white space can save more space. Therefore, we can use the "compact" format of JSON/JData,

 {"mydata":{"a":5,"f":1.1,"c":{"d":[[1,2,3,4,5],[6,7,8,9,10]],"s":"a string"}}}

this results in a total of 79 bytes.

In JData specification, we further reduce this storage overhead by converting the above JSON data into its equivalent binary UBJSON representation. To facilitate the reading of this data structure, we used the "block-notation" by inserting newlines and whites-paces between data records for better formatting and reading. Please note that in the actual UBJSON file, all white spaces and the "[" "]" markers for each "data block" below must be removed.

 [{]
    [U][6][mydata] 
    [{]
        [U][1][a] [U][5]
        [U][1][f] [d][1.1]
        [U][1][c]
        [{]
           [U][1][d] [[]
              [[] [U][1][U][2][U][3][U][4][U][5]  []]
              [[] [U][6][U][7][U][8][U][9][U][10] []]
           []]
           [U][1][s] [S][U][8][a string]
        [}]
    [}]
 [}]

Just to explain the above notations better, you can see that the UBJSON follow a similar syntax as JSON, except that both the name and value fields accepts binary data type markers ([U] for uint8, [d] for float32, [S] for string, [[]...[]] stores an array object, and [{]...[}] stores a map/hash).

the above UBJSON representation requires a total of 73 bytes to store, which is slightly smaller than the text-based JSON format above. One can see that the 2-D array "mydata.c.d" contains numerical elements of the same type, thus, it is a "packed" array, which can take advantage of the "optimized array header" defined in the UBJSON specification to extract all shared data type markers, in this case, [U] to gain more space saving. Using the UBJSON optimized array header, we can store the above data as

 [{]
    [U][6][mydata] 
    [{]
        [U][1][a] [U][5]
        [U][1][f] [d][1.1]
        [U][1][c]
        [{]
           [U][1][d] [[]
              [[] [$][U][#][U][5] [1][2][3][4][5]
              [[] [$][U][#][U][5] [6][7][8][9][10]
           []]
           [U][1][s] [S][U][8][a string]
        [}]
    [}]
 [}]

The above form needs a total of 71 bytes to store, about 10% smaller than the compact-formatted JSON/JData format; in the meantime, all data types are stored in strongly-typed binary form without needing to parse or format when loading or saving.

2.2.6. Quasi-readability of binary JData files

As we mention above, UBJSON is a rather unique binary data format that is "quasi-human-readable". To show this, we can run the strings utility in the Mac/Linux (or Cygwin on windows) command line for the above binary JData file, one can see the below output

 strings -n 2 mydata.jbat
{U
mydata{U
aU
fD?
c{U
d[[$U#U
[$U#U
]U
sSU
a string}}}

if we slightly reformat the above output using another code-formatting text utility Aastyle, the improved text markers extracted from the above binary JData file can be shown as

 strings -n 2 mydata.jbat | astyle
{   U
    mydata{
        U
        aU
        fD?
        c{
            U
            d[[$U#U
            [$U#U
            ]U
            sSU
            a string}}
}
The above printed data skeleton contains all "semantic" elements of the data file, i.e. all the data subfield names, string values, and data type markers. In the meantime, the formatted text outline correctly printed the hierarchical structures of the binary data.

This ability to allow a user to conveniently inspect and understand the overall data structures in a binary file is a unique ability to UBJSON/binary JData, and is not supported by almost all other binary data formats, including HDF5, BSON, MessagePack, etc. This quasi-human-readability readily make many binary-only data types easily searchable and findable once converted to the binary JData formats.

2.2.7. Space savings

In the above toy dataset, the space saving using data compression or after converting to the binary form does not appear to be significant. In this section, we show you how to significantly reduce data file sizes and parsing overhead using JData formats using real-world datasets.

Let's consider a 3-D medical imaging data, a head-CT scan provided by the widely used rendering software MRIcroGL (raw data file). In this 3-D CT scan of a head volume contains 208 x 256 x 225 voxels, with each voxel represented by a 1-byte gray-scale value (0-255).

To store such dataset using the widely supported NIfTI-1 data format, the raw .nii data file requires a total of 11,982 kB, including a 348-byte NIfTI-1 binary header, 4-byte extension markers, followed by 208 x 256 x 225=11,980,800 bytes to store the image data. Often times, a .nii file is stored as gzip-compressed file as .nii.gz. In this case, the .nii.gz file has a size of 2,809 kB.

By using a JData-equivalent data container 100% compatible to the NIfTI header, i.e. the JNIfTI format, we can convert the above .nii file to a JSON/text-based JData file or a UBJSON/binary-JData file. Optionally, as we showed above, one can use data compression via the addition of _ArrayZip*_ tags to further save space while losslessly store the binary data. Some of these stored data samples can be found at this folder.

In the below table, we summarize the data file sizes, encoding/saving time and loading/decoding time across different file formats. The best-two in sizes (smaller the better), saving and loading times (shorter the better) are marked by bold-text in this table. All benchmarks are tested in MATLAB on a Ubuntu 16.04 Linux system, with a Samsung EVO 970 NVME hard drive.

File Formats Suffix File size Saving time (s) Loading time (s)
NIfTI-1 raw .nii 11,982 kB 0.058 s† 0.156 s†
NIfTI-1+gzip .nii.gz 2,809 kB 1.012 s† 0.265 s†
Text-JData (zlib) .jnii 3,492 kB 0.583 s* 0.136 s*/0.075 s‡
Text-JData (lzma) .jnii 2,608 kB 2.041 s* 0.199 s*/0.171 s‡
Binary-JData (no zip).bnii 11,982 kB 0.146 s* 0.138 s*
Binary-JData (zlib).bnii2,583 kB 0.493 s* 0.090 s*
Binary-JData (lzma).bnii1,929 kB 2.128 s* 0.167 s*

1. How to use JData?

The simplest way to use JData in your project is to "drop-in" one of our light-weight JData-compatible libraries, according to your programming language, and use the provided saving/loading interfaces to export and load your data. The currently developed JData-optimized libraries include

As you can see, these toolboxes are super light-weight, with minimum number of dependencies (if present, mostly lightweight and widely available). Therefore, you can add them to your project without much burden and complexity. All our above libraries are open-source with compact source codes available if you are interested in understanding the JData format handling.

What if your programming language has not yet have a JData-compatible library? you can still read/write JData files easily by incorporating one of hundreds freely available JSON parsers or one of nearly 50 UBJSON free parsers. As we mentioned, JData extends JSON/UBJSON in the semantics layer and is syntactically 100% compatible with JSON (and nearly 100% compatible with UBJSON).

If your software already supports reading/writing JSON files, you are already able to load JData files without any additional work. The only work that our above JData-compatible libraries do is an additional step of recognizing JData-defined keywords and reassemble them to the native language data structures (i.e. JData-decoding), or split a complex native data into JSON serializable constructs (i.e. JData-encoding) when saving the data structures. If your language do not have our JData encoder/decoder, your code can still fully access the JData represented data using the JSON interfaces.

Powered by Habitat