Showing revision 6.0

JData 101

Author: Qianqian Fang <q.fang at neu.edu>

Please believe me, JData is here to help! It is not a new format, and is very simple, easy-to-adopt. Read below to see why.

: 1. Why do we need JData?
: 2. What exactly is JData?
: 3. How do you use it?

1. Why do we need JData?

No matter if you are a believer or not that we are living in the era of big data, we all have to deal with data and files from time to time, and we all know that there are numerous file formats that require specific software to open them. In most cases, you can't just open a file using a text editor and make sense out of it because the files are mostly binary and not directly human-readable.

This is bad for scientists -- people who generated a lot of complicated data and process them to understand what's going on in our world -- because they have to deal with increasing types of data files from increasingly complicated experiments. This is also bad for software developers -- people who write software to read/process/display certain type(s) of data and help you understand and process them -- because they have to constantly add libraries or parsers to deal with newly emerged formats and, if the formats are evolving, their future revisions. This is a lot of work and overhead to keep-up, and costly too.

The result of this is that software becomes bulky, with large number of dependencies, and increasingly difficult to maintain and improve. On the other hand, the numerous language-/application-/device-specific data formats, sometimes storing precious experimental data that costed a fortune to collect, are difficult to be shared, reused and understood, because you need a lot of dependent software tools to open them and understand. Even worse, many existing data formats utilizes rigid binary structures that are nearly impossible to extend (such as adding or removing something) without breaking backward-compatibility! Once a data format is revised, most data stored in the old format inevitably become outdated because no developer is interested in maintaining a parser of an outdated format.

People have noticed this problem, and had come up with a number of solutions, but each solution has met their challenges as well. For example:

Extensible Markup Language (XML): a general-purpose data format that is designed to be easily extensible, human-readable and versatile, but it suffers from verbosity, and difficulties when parsing
JavaScript Object Notation (JSON): a light-weight, simple yet highly extensible, human-readable format for storing complex hierarchical data (probably the most popular data format currently supported), but it suffers from 1) lack of strongly-typed data support, 2) large file sizes and parsing overhead compared to binary files
Binary JSON-like formats such as BSON, MessagePack, CBOR: these formats inherit the simplicity and versatility of JSON, but provide better efficient and smaller file size, but nearly all of them lost human-readability
Hierarchical Data Format (HDF5): a sophisticated, highly versatile, extensible and high-performance binary format targeted for storing large amount of scientific data, but to support HDF5 files you will have to know how to program its APIs and those have a learning curve. The file is also not human-readable and you must have the libraries pre-installed to understand the files. The overheads to store lightweight datasets can also be significant.

On top of this, most popular programming languages that used to make data processing software had developed a series of their own complex data types to efficiently processing complicated inputs, but it is currently lacking a common interface for these software to exchange such complex data types from software wrote in one language to another. For example, the data types data scientists or researchers may often use include

Data types	Python	MATLAB	Perl	JavaScript	C++
flexible array	list()	cell()	array	array	-
associative array	dict()	struct()/containers.Map()	%associative array	object, Map()	std::map
packed-typed-arrays	numpy.NDArray	matrices	PDL::matrix	TypedArray, numjs.NdArray	std::array
tables	pandas.DataFrame	table	Data::Table	-	-
linked lists	not built-in	-	-	-	std::list
graph	-	graph()/digraph()	-	-	-
blob	bytes/bytearray	uint8/int8/char vector	File::BLOB	Blob()	void*
complex	complex/numpy.dtype=complex	complex	Math::Complex	numjs.ndarray(dtype='complex')	std::complex
sparse array	scipy.sparse	sparse	-	Map()	-
special matrices	scipy.sparse	sparse	-	-	-

"-" indicates the lack of built-in support and canonical 3rd-party library support

To solve this problem, we need a data representation (and/or) storage format that is:

highly portable - it must be language-/platform-independent
highly extensible - you can add/remove/merge data without breaking existing parsers
highly versatile - to encapsulate complex hierarchical data and allow easy split, merge, and integration
extremely simple - that anyone without training can easily understand and adopt, and ideally/optionally
human-readable - human-readability is the ultimate criteria for long-lived future-proof data, because the data can be easily understood without special tools

The above mentioned general-purpose data formats - XML, JSON, binary JSON or HDF5 - all have potentials, but all need improvements in order to address their limitations. In addition, we need standardization - the representation of a complex data type, event though there are numerous ways to do it, need to have a clearly defined syntax/specification so that they can be unambiguously encoded, shared, and exchanged among software and programming languages.

That is what we are trying to achieve with JData. JData is a standardized, portable middleware-like data annotation protocol that allows one to store complex data types, ranging from simple numerical arrays and values to sparse, complex, hierarchical graph-like data, easily, efficiently, and ambiguously. Such data annotation can also be stored as compact, human-readable files and share among different programs.

2. What exactly is JData?

JData is our solution to help data creators and users, especially scientific researchers who routinely deal with complex data, to easily store, read, understand, share, combine, process, exchange and integrate their data and enable new discoveries. They can save all the extra works to deal with numerous diverse file formats, while focusing on answering the more important questions. Their hard-work generated data files can be also be shared with a large research community by making them FAIR - findable, accessible, interoperable and reusable, helping other researchers to create bigger studies and grander discoveries.

JData 101

1. Why do we need JData?

2. What exactly is JData?

3. How do you use it?