Showing revision 8

JData 101

Author: Qianqian Fang <q.fang at neu.edu>

Please believe me, JData is here to help! It is not a new format, and is very simple, easy-to-adopt. Read below to see why.

: 1. What exactly is JData?
: 2. How do you use it?

<a href="index.cgi?action=edit&id=JData/Basic/Why" class='wikipageedit' >JData/Basic/Why</a>

1. What exactly is JData?

JData is our solution to help data creators and users, especially scientific researchers who routinely deal with complex data, to easily store, read, understand, share, combine, process, exchange and integrate their data and enable new discoveries. They can save all the extra works to deal with numerous diverse file formats, while focusing on answering the more important questions. Their hard-work generated data files can be also be shared with a large research community by making them FAIR - findable, accessible, interoperable and reusable, helping other researchers to create bigger studies and grander discoveries.

We have developed the JData framework to specifically address the above mentioned challenges. This framework includes

A set of data specifications that map complex data structures to an standardized "data annotation" that can ambiguously represent a large variety of data types encountered in the scientific world
A standardized data serialization approach based on existing, widely supported data serialization formats (specifically, we picked JSON and binary JSON - yes, dual-interface!), so that the JData-annotated constructs can be efficiently parsed/stored/converted/transmitted without needing extra works to write brand new parsers, and
A series of libraries and encoders/decoders to map complex data types from different programming environment to the common JData annotation format, export and import between programming languages using a JSON/binary JSON file

Simplicity and human-readability are among our requirements in the design of this framework because we understand that for a data standard targeting for general audiences, it has to be simple! The format has to be simple and intuitive, the serialization, reading/writing of the file has to be easily and widely acceptable without much programming overhead, and future extension must be inherent - once a data file is created, it must be easily modified without worrying about breaking the parsers, so it can stay for a long time. Human-readability is harder to achieve, because it typically conflicts with speed and data size. However, our solution attempts to strike a balance between readability and efficiency, and ensures that all "semantic components", such as data item names, strings, and data types, are easily readable and understood, while the binary data payload can be stored with compact size with compression or filters.

Let's start with a simple example. Assume we have have a 2x4 integer array 'mydata', which one can define similarly in several programming languages as

 Python:     mydata=[[1,2,3,4], [5,6,7,8]]
 JavaScript: mydata=[[1,2,3,4], [5,6,7,8]]
 MATLAB:     mydata=[[1,2,3,4]; [5,6,7,8]]

In JData, we can represent this data array using the widely-supported, plain-text-based JSON format as

 {
    "mydata": [[1,2,3,4], [5,6,7,8]]
 }

First, this representation is quite intuitive and easy to understand; one can use a text-editor and open the file and understand the data without much trouble. Secondly, JSON encoders are freely available in almost all programming languages, and they are easy to program or use.

Now, let's make this better - imagine that we want to store this array using an "unsigned byte", i.e, "uint8" format. The above JSON based formats started to meet its limitations - because the JSON specification does not support strongly-typed binary data.

In JData, we have to introduce a new data representation - the annotated array format to let JSON support strongly-typed binary data

 {
    "mydata": {
        "_ArrayType_": "uint8",
        "_ArraySize_": [2,4],
        "_ArrayData_": [1,2,3,4,5,6,7,8]
     }
 }

The above annotated format does look slightly more complicated than the first form (we call it the "direct format"), but it remains easily understood - more importantly, the annotated-array remains fully syntactically compatible with the JSON format - in other words, this JSON data representation is now capable of encoding data types.

We can go one step further, JData extends JSON's capability further to represent even binary data. For example, the binary data in the above array can be stored in an encoded string, and such binary can be even compressed to save space, for example:

 {
    "mydata": {
        "_ArrayType_": "uint8",
        "_ArraySize_": [2,4],
        "_ArrayZipType_": "zlib",
        "_ArrayZipSize_": [1,8],
        "_ArrayZipData_": "eJxjZGJmYWVj5wAAAIAAJQ=="
     }
 }

The above JSON construct stores the 8-element uint8 array above by first compressing the byte streams using zlib-deflate algorithm first (as indicated by the "_ArrayZipType_") and then converted to ASCII text using base64-encoding. At this point, we use a JSON-compliant annotation format to store strongly typed binary data, that is previously not supported by the native JSON specification.

JData 101

1. What exactly is JData?

2. How do you use it?