Skip to content
Manik Surtani edited this page May 18, 2012 · 3 revisions

Glossary

This page attempts to define some of the terms used in this specification.

Data Grid

Data grids, or IMDGs (In-memory data grids) are, according to Gartner, defined as:

IMDGs implement the notion of a "distributed, in-memory virtual data store" (typically called the data grid, but at times called the "cache" or "space" for historical reasons) by clustering the central memory (RAM) of multiple computers over a network. This allows applications to deal with very large (up to multiple petabytes in size, in some user experiences) in-memory data stores, and leverage fast and scalable access to data. IMDGs provide the mechanisms and APIs that presents to applications the memory of the clustered computers as a uniform, integrated data store. Applications don't need to know in which computer's RAM a given data object is stored to retrieve it. The IMDG runtime retrieves the required object across the data grid in a location-transparent way, while managing such issues as security, data integrity, availability and recovery, in case of system crashes.

The IMDG concept has been explored in academia for almost three decades as an example in the context of the research about the "tuple/space" computing model carried out in the 1980s at Yale University (David Gelernter's Project Linda). IMDG products have been in the market since the early 2000s, and thus have reached a notable degree of technical maturity and adoption rate.

Cache

A temporary in-memory store of data exhibiting high performance, threadsafe access. JSR 107 (Temporary caching for the Java Platform) covers this concept in more detail.

Distributed Cache

Often, data grids are used as distributed, cluster-aware in-memory caches, usually placed in front of a more expensive, slower data store such as a relational database. Standalone caches don't work in this regard, if the application tier is clustered, as caches could serve stale data. JSR 107 covers distributed caches as well to some degree.

Cluster

A set of servers connected via a network, usually a LAN.

Distribution

The concept of a cluster spreading data across its various constituent nodes in a manner transparent to any client attempting to locate or use such data.

Node

A member in a cluster. A node may be a separate physical machine, a separate virtual machine on the same physical host, or a separate JVM on the same physical or virtual machine. Typically, each node would have its own network address, such as an IP address and port, on which other nodes could connect to it.

Transaction

An atomic unit of work. Transactions may be JTA and XA compliant.

Colocation

The concept of ensuring data entries that are used together in the same transaction are stored on the same nodes in a cluster.

Map/Reduce

Based on Google's seminal paper from 2004, Map/Reduce allows computations on the entire data set to be broken down into tasks that run on each node and then aggregate results. It is a divide and conquer technique for dealing with large data sets.

Eventual Consistency

Based on Eric Brewer's CAP theorem which outlines desirable characteristics of distributed systems, Eventual Consistency is the result of attempting to provide high availability even during network partitions. See Coda Hale's excellent blog on the subject.