Thursday, 9 October 2014

3 V's of Big Data

As a catch-all term, “big data” can be pretty nebulous, in the same way that the term “cloud” covers diverse technologies. Input data to big data systems could be chatter from social networks, web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. Are these all really the same thing?

To clarify matters, the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data. They’re a helpful lens through which to view and understand the nature of the data and the software platforms available to exploit them. Most probably you will contend with each of the Vs to one degree or another.

Volume

The benefit gained from the ability to process large amounts of information is the main attraction of big data analytics. Having more data beats out having better models: simple bits of math can be unreasonably effective given large amounts of data. If you could run that forecast taking into account 300 factors rather than 6, could you predict demand better?

This volume presents the most immediate challenge to conventional IT structures. It calls for scalable storage, and a distributed approach to querying. Many companies already have large amounts of archived data, perhaps in the form of logs, but not the capacity to process it.

Assuming that the volumes of data are larger than those conventional relational database infrastructures can cope with, processing options break down broadly into a choice between massively parallel processing architectures — data warehouses or databases such as Greenplum — and Apache Hadoop-based solutions. This choice is often informed by the degree to which the one of the other “Vs” — variety — comes into play. Typically, data warehousing approaches involve predetermined schemas, suiting a regular and slowly evolving dataset. Apache Hadoop, on the other hand, places no conditions on the structure of the data it can process.

At its core, Hadoop is a platform for distributing computing problems across a number of servers. First developed and released as open source by Yahoo, it implements the MapReduce approach pioneered by Google in compiling its search indexes. Hadoop’s MapReduce involves distributing a dataset among multiple servers and operating on the data: the “map” stage. The partial results are then recombined: the “reduce” stage.

To store data, Hadoop utilizes its own distributed filesystem, HDFS, which makes data available to multiple computing nodes. A typical Hadoop usage pattern involves three stages:

loading data into HDFS,
MapReduce operations, and
retrieving results from HDFS.

This process is by nature a batch operation, suited for analytical or non-interactive computing tasks. Because of this, Hadoop is not itself a database or data warehouse solution, but can act as an analytical adjunct to one.

One of the most well-known Hadoop users is Facebook, whose model follows this pattern. A MySQL database stores the core data. This is then reflected into Hadoop, where computations occur, such as creating recommendations for you based on your friends’ interests. Facebook then transfers the results back into MySQL, for use in pages served to users.

Velocity

The importance of data’s velocity — the increasing rate at which data flows into an organization — has followed a similar pattern to that of volume. Problems previously restricted to segments of industry are now presenting themselves in a much broader setting. Specialized companies such as financial traders have long turned systems that cope with fast moving data to their advantage. Now it’s our turn.

Why is that so? The Internet and mobile era means that the way we deliver and consume products and services is increasingly instrumented, generating a data flow back to the provider. Online retailers are able to compile large histories of customers’ every click and interaction: not just the final sales. Those who are able to quickly utilize that information, by recommending additional purchases, for instance, gain competitive advantage. The smartphone era increases again the rate of data inflow, as consumers carry with them a streaming source of geolocated imagery and audio data.

It’s not just the velocity of the incoming data that’s the issue: it’s possible to stream fast-moving data into bulk storage for later batch processing, for example. The importance lies in the speed of the feedback loop, taking data from input through to decision. A commercial from IBM makes the point that you wouldn’t cross the road if all you had was a five-minute old snapshot of traffic location. There are times when you simply won’t be able to wait for a report to run or a Hadoop job to complete.

Industry terminology for such fast-moving data tends to be either “streaming data,” or “complex event processing.” This latter term was more established in product categories before streaming processing data gained more widespread relevance, and seems likely to diminish in favor of streaming.

There are two main reasons to consider streaming processing. The first is when the input data are too fast to store in their entirety: in order to keep storage requirements practical some level of analysis must occur as the data streams in. At the extreme end of the scale, the Large Hadron Collider at CERN generates so much data that scientists must discard the overwhelming majority of it — hoping hard they’ve not thrown away anything useful. The second reason to consider streaming is where the application mandates immediate response to the data. Thanks to the rise of mobile applications and online gaming this is an increasingly common situation.

Product categories for handling streaming data divide into established proprietary products such as IBM’s InfoSphere Streams, and the less-polished and still emergent open source frameworks originating in the web industry: Twitter’s Storm, and Yahoo S4.

As mentioned above, it’s not just about input data. The velocity of a system’s outputs can matter too. The tighter the feedback loop, the greater the competitive advantage. The results might go directly into a product, such as Facebook’s recommendations, or into dashboards used to drive decision-making.

It’s this need for speed, particularly on the web, that has driven the development of key-value stores and columnar databases, optimized for the fast retrieval of precomputed information. These databases form part of an umbrella category known as NoSQL, used when relational models aren’t the right fit.

Variety

Rarely does data present itself in a form perfectly ordered and ready for processing. A common theme in big data systems is that the source data is diverse, and doesn’t fall into neat relational structures. It could be text from social networks, image data, a raw feed directly from a sensor source. None of these things come ready for integration into an application.

Even on the web, where computer-to-computer communication ought to bring some guarantees, the reality of data is messy. Different browsers send different data, users withhold information, they may be using differing software versions or vendors to communicate with you. And you can bet that if part of the process involves a human, there will be error and inconsistency.

A common use of big data processing is to take unstructured data and extract ordered meaning, for consumption either by humans or as a structured input to an application. One such example is entity resolution, the process of determining exactly what a name refers to. Is this city London, England, or London, Texas? By the time your business logic gets to it, you don’t want to be guessing.

The process of moving from source data to processed application data involves the loss of information. When you tidy up, you end up throwing stuff away. This underlines a principle of big data: when you can, keep everything. There may well be useful signals in the bits you throw away. If you lose the source data, there’s no going back.

Despite the popularity and well understood nature of relational databases, it is not the case that they should always be the destination for data, even when tidied up. Certain data types suit certain classes of database better. For instance, documents encoded as XML are most versatile when stored in a dedicated XML store such as MarkLogic. Social network relations are graphs by nature, and graph databases such as Neo4J make operations on them simpler and more efficient.

Even where there’s not a radical data type mismatch, a disadvantage of the relational database is the static nature of its schemas. In an agile, exploratory environment, the results of computations will evolve with the detection and extraction of more signals. Semi-structured NoSQL databases meet this need for flexibility: they provide enough structure to organize data, but do not require the exact schema of the data before storing it.

Tuesday, 7 October 2014

Big Data

What is big data?

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

The hot IT buzzword of 2012, big data has become viable as cost-effective approaches have emerged to tame the volume, velocity and variability of massive data. Within this data lie valuable patterns and information, previously hidden because of the amount of work required to extract them. To leading corporations, such as Walmart or Google, this power has been in reach for some time, but at fantastic cost. Today’s commodity hardware, cloud architectures and open source software bring big data processing into the reach of the less well-resourced. Big data processing is eminently feasible for even the small garage startups, who can cheaply rent server time in the cloud.

The value of big data to an organization falls into two categories: analytical use, and enabling new products. Big data analytics can reveal insights hidden previously by data too costly to process, such as peer influence among customers, revealed by analyzing shoppers’ transactions, social and geographical data. Being able to process every item of data in reasonable time removes the troublesome need for sampling and promotes an investigative approach to data, in contrast to the somewhat static nature of running predetermined reports.

The past decade’s successful web startups are prime examples of big data used as an enabler of new products and services. For example, by combining a large number of signals from a user’s actions and those of their friends, Facebook has been able to craft a highly personalized user experience and create a new kind of advertising business. It’s no coincidence that the lion’s share of ideas and tools underpinning big data have emerged from Google, Yahoo, Amazon and Facebook.

The emergence of big data into the enterprise brings with it a necessary counterpart: agility. Successfully exploiting the value in big data requires experimentation and exploration. Whether creating new products or looking for ways to gain competitive advantage, the job calls for curiosity and an entrepreneurial outlook.

Sunday, 5 October 2014

Hadoop - A Map/Reduce Implementation

Hadoop is a framework for managing large data processing, analysing and getting useful results out of that.

1.1. The Magic of HDFS

The idea underpinning map/reduce--bringing compute to the data instead of the opposite--should sound like a very simple solution to the I/O bottleneck inherent in traditional parallelism. However, the devil is in the details, and implementing a framework where a single large file is transparently diced up and distributed across multiple physical computing elements (all while appearing to remain a single file to the user) is not trivial.

Hadoop, perhaps the most widely used map/reduce framework, accomplishes this feat using HDFS, the Hadoop Distributed File System. HDFS is fundamental to Hadoop because it provides the data chunking and distribution across compute elements necessary for map/reduce applications to be efficient. Since we're now talking about an actual map/reduce implementation and not an abstract concept, let's refer to the abstract compute elements now as compute nodes.

HDFS exists as a filesystem into which you can copy files to and from in a manner not unlike any other filesystem. Many of the typical commands for manipulating files (ls, mkdir, rm, mv, cp, cat, tail, and chmod, to name a few) behave as you might expect in any other standard filesystem (e.g., Linux's ext4).

The magical part of HDFS is what is going on just underneath the surface. Although it appears to be a filesystem that contains files like any other, in reality those files are distributed across multiple physical compute nodes:

When you copy a file into HDFS as depicted above, that file is transparently sliced into 64 MB "chunks" and replicated three times for reliability. Each of these chunks are distributed to various compute nodes in the Hadoop cluster so that a given 64 MB chunk exists on three independent nodes. Although physically chunked up and distributed in triplicate, all of your interactions with the file on HDFS still make it appear as the same single file you copied into HDFS initially. Thus, HDFS handles all of the burden of slicing, distributing, and recombining your data for you.

HDFS's chunk size and replication

HDFS's chunk size and replication
The 64 MB chunk (block) size and the choice to replicate your data three times are only HDFS's default values. These decisions can be changed: the 64 MB block size can be modified by changing the `dfs.block.size` property in `hdfs-site.xml`. It is common to increase this to 128 MB in production environments. the replication factor can be modified by changing the `dfs.replication` property in `hdfs-site.xml`. It can also be changed on a per-file basis by specifying `-D dfs.replication=1` on your `-put` command line, or using the`hadoop dfs -setrep -w 1` command.

The 64 MB chunk (block) size and the choice to replicate your data three times are only HDFS's default values. These decisions can be changed:

the 64 MB block size can be modified by changing the dfs.block.size property in hdfs-site.xml. It is common to increase this to 128 MB in production environments.
the replication factor can be modified by changing the dfs.replication property in hdfs-site.xml. It can also be changed on a per-file basis by specifying -D dfs.replication=1 on your -put command line, or using thehadoop dfs -setrep -w 1 command.

1.2. Map/Reduce Jobs

HDFS is an interesting technology in that it provides data distribution, replication, and automatic recovery in a user-space filesystem that is relatively easy to configure and, conceptually, easy to understand. However, its true utility comes to light when map/reduce jobs are executed on data stored in HDFS.

As the name implies, map/reduce jobs are principally comprised of two steps: the map step and the reduce step. The overall workflow generally looks something like this:

Program flow of a map/reduce application

The left half of the diagram depicts the HDFS magic described in the previous section, where the hadoop dfs -copyFromLocal command is used to move a large data file into HDFS and it is automatically replicated and distributed across multiple physical compute nodes. While this step of moving data into HDFS is not strictly a part of a map/reduce job (i.e., your dataset may already have a permanent home on HDFS just like it would any other filesystem), a map/reduce job's input data must already exist on HDFS before the job can be started.

1.2.1. The Map Step

Once a map/reduce job is initiated, the map step

Launches a number of parallel mappers across the compute nodes that contain chunks of your input data
For each chunk, a mapper then "splits" the data into individual lines of text on newline characters (\n)
Each split (line of text that was terminated by \n) is given to your mapper function
Your mapper function is expected to turn each line into zero or more key-value pairs and then "emit" these key-value pairs for the subsequent reduce step

That is, the map step's job is to transform your raw input data into a series of key-value pairs with the expectation that these parsed key-value pairs can be analyzed meaningfully by the reduce step. It's perfectly fine for duplicate keys to be emitted by mappers.

Input splitting
The decision to split your input data along newline characters is just the default behavior, which assumes your input data is just an ascii text file. You can change how input data is split before being passed to your mapper function using alternate `InputFormat`s.

1.2.2. The Reduce Step

Once all of the mappers have finished digesting the input data and have emitted all of their key-value pairs, those key-value pairs are sorted according to their keys and then passed on to the reducers. The reducers are given key-value pairs in such a way that all key-value pairs sharing the same key always go to the same reducer. The corollary is then that if one particular reducer has one specific key, it is guaranteed to have all other key-value pairs sharing that same key, and all those common keys will be in a continuous strip of key-value pairs that reducer received.

Your job's reducer function then does some sort of calculation based on all of the values that share a common key. For example, the reducer might calculate the sum of all values for each key (e.g., the word count example). The reducers then emit key-value pairs back to HDFS where each key is unique, and each of these unique keys' values are the result of the reducer function's calculation.

The Sort and Shuffle

The Sort and Shuffle
The process of sorting and distributing the mapper's output to the reducers can be seen as a separate step often called the "shuffle". What really happens is that as mappers emit key-value pairs, the keys are passed through the`Partitioner` to determine which reducer they are sent to. The default `Partitioner` is a function which hashes the key and then takes the modulus of this hash and the number of reducers to determine which reducer gets that key-value pair. Since the hash of a given key will always be the same, all key-value pairs sharing the same key will get the same output value from the `Partitioner` and therefore wind up on the same reducer. Once all key-value pairs are assigned to their reducers, the reducers all sort their keys so that a single loop over all of a reducer's keys will examine all the values of a single key before moving on to the next key. As you will see in my tutorial on writing mappers and reducers in Python, this is an essential feature of the Hadoop streaming interface.

The process of sorting and distributing the mapper's output to the reducers can be seen as a separate step often called the "shuffle". What really happens is that as mappers emit key-value pairs, the keys are passed through thePartitioner to determine which reducer they are sent to.

The default Partitioner is a function which hashes the key and then takes the modulus of this hash and the number of reducers to determine which reducer gets that key-value pair. Since the hash of a given key will always be the same, all key-value pairs sharing the same key will get the same output value from the Partitioner and therefore wind up on the same reducer.

Once all key-value pairs are assigned to their reducers, the reducers all sort their keys so that a single loop over all of a reducer's keys will examine all the values of a single key before moving on to the next key. As you will see in my tutorial on writing mappers and reducers in Python, this is an essential feature of the Hadoop streaming interface.

Tuesday, 30 September 2014

Introduction To MapReduce

MapReduce (M/R) is a technique for dividing work across a distributed system. This takes advantage of the parallel processing power of distributed systems, and also reduces network bandwidth as the algorithm is passed around to where the data lives, rather than a potentially huge dataset transferred to a client algorithm. Developers can use MapReduce for things like filtering documents by tags, counting words in documents, and extracting links to related data.

In Riak, MapReduce is one method for non-key-based querying. MapReduce jobs can be submitted through the HTTP API or the Protocol Buffers API. Also, note that Riak MapReduce is intended for batch processing, not real-time querying.

Comparing Map/Reduce to Traditional Parallelism

In order to appreciate what map/reduce brings to the table, I think it is most meaningful to contrast it to what I calltraditional computing problems. I define "traditional" computing problems as those which use libraries like MPI, OpenMP, CUDA, or pthreads to produce results by utilizing multiple CPUs to perform some sort of numerical calculation concurrently. Problems that are well suited to being solved with these traditional methods typically share two common features:

They are cpu-bound: the part of the problem that takes the most time is doing calculations involving floating point or integer arithmetic
Input data is gigabyte-scale: the data that is necessary to describe the conditions of the calculation are typically less than a hundred gigabytes, and very often only a few hundred megabytes at most

Item #1 may seem trivial; after all, computers are meant to compute, so wouldn't all of the problems that need to be parallelized be fundamentally limited by how quickly the computer can do numerical calculations?

Traditionally, the answer to this question has been yes, but the technological landscape has been rapidly changing over the last decade. Sources of vast, unending data (e.g., social media, inexpensive genenome sequencing) have converged with inexpensive, high-capacity hard drives and the advanced filesystems to support them, and now data-intensive computing problems are emerging. In contrast to the aforementioned traditional computing problems, data-intensive problems demonstrate the following features:

Input data is far beyond gigabyte-scale: datasets are commonly on the order of tens, hundreds, or thousands of terabytes
They are I/O-bound: it takes longer for the computer to get data from its permanent location to the CPU than it takes for the CPU to operate on that data

Traditional Parallel Applications:
To illustrate these differences, the following schematic depicts how your typical traditionally parallel application works.

The input data is stored on some sort of remote storage device (a SAN, a file server serving files over NFS, a parallel Lustre or GPFS filesystem, etc; grey cylinders). The compute resources or elements (blue boxes) are abstract units that can represent MPI ranks, compute nodes, or threads on a shared-memory system.

Upon launching a traditionally parallel application,

A master parallel worker (MPI rank, thread, etc) reads the input data from disk (green arrow).
NOTE: In some cases multiple ranks may use a parallel I/O API like MPI-IO to collectively read input data, but the filesystem on which the input data resides must be a high-performance filesystem that can sustain the required device- and network-read bandwidth.

The master worker then divides up the input data into chunks and sends parts to each of the other workers (red arrows).

All of the parallel workers compute their chunk of the input data

All of the parallel workers communicate their results with each other, then continue the next iteration of the calculation

Data-intensive Applications:
The map/reduce paradigm is a completely different way of solving a certain subset of parallelizable problems that gets around the bottleneck of ingesting input data from disk (that pesky green arrow). Whereas traditional parallelism brings the data to the compute, map/reduce does the opposite--it brings the compute to the data:

In map/reduce, the input data is not stored on a separate, high-capacity storage system. Rather, the data exists in little pieces and is permanently stored on the compute elements. This allows our parallel procedure to follow these steps:

We don't have to move any data since it is pre-divided and already exists on nodes capable of acting as computing elements

All of the parallel worker functions are sent to the nodes where their respective pieces of the input data already exist and do their calculations

All of the parallel workers communicate their results with each other, move data if necessary, then continue the next step of the calculation

Thus, the only time data needs to be moved is when all of the parallel workers are communicating their results with each other in step #3. There is no more serial step where data is being loaded from a storage device before being distributed to the computing resources because the data already exists on the computing resources.
Of course, for the compute elements to be able to do their calculations on these chunks of input data, the calculations and data must be all completely independent from the input data on other compute elements. This is the principal constraint in map/reduce jobs: map/reduce is ideally suited for trivially parallel calculations on large quantities of data, but if each worker's calculations depend on data that resides on other nodes, you will begin to encounter rapidly diminishing returns.

Friday, 22 August 2014

Introduction with JSON

What is JSON?

JSON, or JavaScript Object Notation, is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML.

Example

Append ?format=json-pretty to the URL of any page on your Squarespace site and you'll see a deluge of JSON data. Here is a small sample of what that might look like:

{
  "collection" : {
    "title" : "Blog",
    "description" : "This is a description of my blog.",
    "categories" : [ "Category-1", "Category-2" ]
  }
}

Keys and Values

The two primary parts that make up JSON are keys and values. Together they make a key/value pair.

Key: A key is always a string enclosed in quotation marks.
Value: A value can be a string, number, boolean expression, array, or object.
Key/Value Pair: A key value pair follows a specific syntax, with the key followed by a colon followed by the value. Key/value pairs are comma separated.

Let's take one line from the JSON sample above and identify each part of the code.

"title" : "Blog"

his example is a key/value pair. The key is "title" and the value is "Blog".

Types of Values

Array: An associative array of values.
Boolean: True or false.
Number: An integer.
Object: An associative array of key/value pairs.
String: Several plain text characters which usually form a word.

Numbers, booleans and strings are self-evident, so we'll skip over those sections. Arrays and Objects are explained in more depth below.

Arrays

Almost every blog has categories and tags. In this example we've added a categories key, but the value might look unfamiliar. Since each post in a blog can have more than one category, an array of multiple strings is returned.

"collection" : {
  "title" : "Blog",
  "categories" : [ "Category-1", "Category-2" ]
}

Objects

An object is indicated by curly brackets. Everything inside of the curly brackets is part of the object. We already learned a value can be an object. So that means "collection" and the corresponding object are a key/value pair.

"collection" : {
  "title" : "Blog"
}

he key/value pair "title" : "Blog" is nested inside the key/value pair "collection" : { ... }. That's an example of a hierarchy in JSON data.

Wednesday, 20 August 2014

Intalling MongoDB on linux

Download

MongoDb installation packages are available for both 32 bit and 64 bit Linux. You can download it, and install.

Here is the link to download the installation packages of MongoDb : http://www.mongodb.org/downloads

Unzip

After downloading the zip file, unzip it to the folder where you want to install.

Create a data directory

MongoDb stores data in db folder within data folder. But, since this data folder is not created automatically, you have to create it manually. Remember that data directory should be created in the root (/).

Run the MongoDb server from command prompt

To run MongoDb server from command prompt, you have to execute mongod.exe file from bin folder of mongodb folder.

Getting started with administrative shell

To start administrative shell, enter bin directory of your MongoDb installation and execute mongo.exe file. The default administrative shell of MongoDb is a JavaScript shell. When you connect mongodb immediately after installation, it connects to the test document (database).

Since it is a JavaScript Shell, you can run some simple arithmetic operation.

db command will show you the list of documents(databases).

We will insert a simple record and retrieve the data now.

The first command inserts 8 to the z field, to the w3r collection(table).

MongoDb web interface

At a port number 1000 more than the port on which the MongoDb server is running, you can access a web interface of MongoDb.

If MongoDb is running at the default port 27017, the you can access the web interface at 28017.

- See more at: http://www.w3resource.com/mongodb/installation-Linux.php#sthash.vvup5NU1.dpuf

Monday, 18 August 2014

Categories of NoSql

There are four general types (most common categories) of NoSQL databases. Each of these categories has its own specific attributes and limitations. There is not a single solutions which is better than all the others, however there are some databases that are better to solve specific problems. To clarify the NoSQL databases, lets discuss the most common categories :

Key-value stores
Column-oriented
Graph
Document oriented

Key-value stores

Key-value stores are most basic types of NoSQL databases.
Designed to handle huge amounts of data.
Based on Amazon’s Dynamo paper.
Key value stores allow developer to store schema-less data.
In the key-value storage, database stores data as hash table where each key is unique and the value can be string, JSON, BLOB (basic large object) etc.
A key may be strings, hashes, lists, sets, sorted sets and values are stored against these keys.
For example a key-value pair might consist of a key like "Name" that is associated with a value like "Robin".
Key-Value stores can be used as collections, dictionaries, associative arrays etc.
Key-Value stores follows the 'Availability' and 'Partition' aspects of CAP theorem.
Key-Values stores would work well for shopping cart contents, or individual values like color schemes, a landing page URI, or a default account number.

Example of Key-value store DataBase : Redis, Dynamo, Riak. etc.

Pictorial Presentation :

Column-oriented databases

Column-oriented databases primarily work on columns and every column is treated individually.
Values of a single column are stored contiguously.
Column stores data in column specific files.
In Column stores, query processors work on columns too.
All data within each column datafile have the same type which makes it ideal for compression.
Column stores can improve the performance of queries as it can access specific column data.
High performance on aggregation queries (e.g. COUNT, SUM, AVG, MIN, MAX).
Works on data warehouses and business intelligence, customer relationship management (CRM), Library card catalogs etc.

Example of Column-oriented databases : BigTable, Cassandra, SimpleDB etc.

Pictorial Presentation :

Graph databases

A graph data structure consists of a finite (and possibly mutable) set of ordered pairs, called edges or arcs, of certain entities called nodes or vertices.

Following picture presents a labeled graph of 6 vertices and 7 edges.

What is a Graph Databases?

A graph database stores data in a graph.
It is capable of elegantly representing any kind of data in a highly accessible way.
A graph database is a collection of nodes and edges
Each node represents an entity (such as a student or business) and each edge represents a connection or relationship between two nodes.
Every node and edge is defined by a unique identifier.
Each node knows its adjacent nodes.
As the number of nodes increases, the cost of a local step (or hop) remains the same.
Index for lookups.

Here is a comparison between the classic relational model and the graph model :

Relational model	Graph model
Tables	Vertices and Edges set
Rows	Vertices
Columns	Key/value pairs
Joins	Edges

Example of Graph databases : OrientDB, Neo4J, Titan.etc.

Pictorial Presentation :

Document Oriented databases

A collection of documents
Data in this model is stored inside documents.
A document is a key value collection where the key allows access to its value.
Documents are not typically forced to have a schema and therefore are flexible and easy to change.
Documents are stored into collections in order to group different kinds of data.
Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.

Here is a comparison between the classic relational model and the document model :

Relational model	Document model
Tables	Collections
Rows	Documents
Columns	Key/value pairs
Joins	not available

Example of Document Oriented databases : MongoDB, CouchDB etc.

Pictorial Presentation :

- See more at: http://www.w3resource.com/mongodb/nosql.php#sthash.lXcRhVwy.dpuf