NoSQL Databases
 
These are databases that are NOT organized
around tables and not around objects as 
primary data structures.
 
Originally NO-SQL meant:
They do not use SQL as the method to access 
data.
 
But some NO-SQL databases created an SQL layer on top of the
NO-SQL database.  So NO-SQL was silently renamed into 
 
NOT ONLY SQL.
 
 
Recently several of those have become 
popular.
 
Why?
 
The two main problems with relational 
databases are:
 
1) They are often inefficient when many 
big joins have to be performed.  And due 
to normalization, joins ALWAYS
have to performed.
 
(Remember this was one of the reasons 
for the OO model too!)
 
2) Much of modern data goes beyond the 
simple data values that are stored in 
tables. People want to store images,
videos, sound files, and whole documents.
 
This lecture is based on:
1) The book "Seven Databases in Seven Weeks"
by Eric Redmond and Jim R. Wilson.
Pragmatic Book Shelf, 2012.
 
2) Numerous Wikipedia pages.
 
3) Home pages of the different systems.
 
This website lists many more:
 
http://nosql-database.org/
 
Four major kinds:
 
1) Key-Value Store
 
We studied this already in JSON.
You send in the key, you get back the value.
 
Examples: Redis, Riak.
 
 
2) Columnar Databases
 
On disk, all data values of one column 
are stored together.
 
(In a normal relational database data
ROWS are stored together.)
 
Examples: HBase, Cassandra
 
 
3) Document Databases
 
An extension of the Key-Value model.
Very flexible.
 
Examples: MongoDB, CouchDB
 
 
4) Graph Databases
 
Designed for storing "node and link" structures.
 
Example: Neo4J
 
 
 
What is a Key-Value store?
 
When you insert data, you provide pairs of data items.
When you query you provide the first element of a 
pair and expect to get the second element back.
 
{"firstname" : "John", "lastname" : "Smith", "city" : "Newark"}
 
 
Quick Introduction to Six NoSQL Databases
 
MongoDB
-------
 
MongoDB is a document database.  It's name comes from
huMONGOus DataBase.
 
(The name "document" is misleading though.)
 
Mongo is a database of JSON documents.
 
A Mongo document is like a relational table row,
without a schema. The values can be nested to
any depth.
 
MongoDB has been adopted as backend software 
by a number of major websites and services,
including Craigslist, eBay, Foursquare, 
SourceForge, and The New York Times.
 
Unfortunately, at this point MongoDB is "mostly" NOT free
anymore.
 
 
 
 
Riak
----
 
Based on early work of Amazon. Written in the 
programming language Erlang.  Erlang was designed
by Ericsson (the phone company).  
 
Erlang = Ericson Language (Erlangen is also a city in 
Germany)
 
It supports "hot swapping" which means the program can 
be changed without stopping it and restarting it.
 
Riak is a Key-Value store that is fault-tolerant by being
replicated on several (typically 3) "nodes" (computers).
 
Riak databases are accessed over the web, with a URL.
The main operations are 
 
POST (that means create)
PUT (update)
GET (read back)
DELETE (delete)
 
(People call these operations generically
CRUD... create, update, read, delete)
So the above shows you "how you say CRUD
in Riak.")
 
Access is possible from the languages 
Ruby, Java, Erlang, Python, PHP, and C/C++
 
 
Side Comment:
 
Riak supports Mapreduce. (Or MapReduce).
 
What is Mapreduce?
 
"Mapreduce" is a framework for processing 
parallelizable problems across huge datasets
using a large number of computers (nodes), 
collectively referred to as a cluster.
 
The following is a VERY SIMPLIFIED idea. Details
will follow in a separate lecture.
 
"Map" step: The master node takes the input, 
divides it into smaller sub-problems, and
distributes them to worker nodes. 
The worker node processes the smaller problem, 
and passes the answer back to its master node.
 
"Reduce" step: The master node then collects the 
answers to all the sub-problems and
combines them in some way to form the output 
(answer) to the problem it was originally trying 
to solve.
 
Example: Find the largest number of a million 
numbers.  You have 11 nodes (processors). 
One Master Node and 10 worker nodes.
 
Map: Send 100,000 numbers to each node.
So every number sits on one of the 10 worker nodes.
 
Each worker node now finds the largest number of 
its 100,000 numbers and send it back to the master
node. 
 
Reduce:
The master node now has 10 numbers and finds
what the largest of them is.
 
 
HBase
-----
 
A columnar database. It stores whole columns
together. Written in Java. Distributed by the Apache
Software Foundation.
 
ID         Last     First     Bonus
--------------------------------------
1          Doe      John      8000
2          Smith    Jane      4000
3          Beck     Sam       1000
 
In a row-oriented database management system, 
the data would be stored like this:  
1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000;   
 
In a column-oriented database management system, 
the data would be stored like this:
1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000; 
 
Basically, HBASE is a two level system of key-value pairs.
 
An HBase table consists of rows, keys, column families
columns and values. 
 
A key identifies a row. Within each column family there
are several columns. A column identifies a value within
a column family. 
 
See figures at these web sites:
 
http://chase-seibert.github.io/blog/2013/04/26/hbase-schema-design.html
 
Copy and paste from above, explaining rows:
 
...each row is basically a linked list, ordered by column family and then column
name. This is how it is laid down on disk, as well. Missing columns are free,
because there is no space on disk pre-allocated to a null column. Given that,
it is reasonable to design a schema where rows have hundreds or thousands of
columns.
 
...  row keys can by any collection of characters. Ordering of row keys is
alphabetical. This is in contrast to most RDBMS, where
rowkeys are integers and ordered as such.
 
http://www.informit.com/articles/article.aspx?p=2253412
 
 
This is really good for systems with lots of NULL values!
 
HBASE is based on HADOOP.
 
Now that is a lecture by itself.
And it is a hot topic.
 
Here is the Wikipedia definition of Hadoop (minimally edited):
................
Apache Hadoop is an open-source software framework for 
distributed storage and distributed processing of 
Big Data on clusters of commodity hardware. 
 
Its Hadoop Distributed File System (HDFS) splits 
files into large blocks (default 64MB or 128MB) 
and distributes the blocks amongst the 
nodes (= computers!) in the cluster. 
 
For processing the data, the Hadoop MapReduce ships code 
(specifically Jar files) to the nodes that have the required data, 
and the nodes then process the data in parallel. 
 
This approach takes advantage of data locality, in contrast to
conventional HPC architecture which usually relies on a parallel 
file system (computing and data separated, but connected with 
high-speed networking).
................
 
In simple words: Hadoop implements mapreduce.
 
 
 
CouchDB
-------
 
Also written in Erlang. Can run on any equipment from an Android
phone to a data center. 
 
Name stands for Cluster Of Unreliable Commodity Hardware.
(A "commodity" is something that is cheap and easy to get.
Like potatoes.)
 
Like MongoDB, CouchDB stores JSON objects.
 
Very fault tolerant.
 
Also created by Apache.
 
Also allows MapReduce.
Queried from JavaScript.
 
Another term you will hear a lot:
 
REST = REpresentational State Transfer.
 
REST is a simple stateless architecture that generally
runs over HTTP.  Stateless means the server does not
remember each client.  
 
Simple HTTP is used to make calls between machines.
 
It us usually introduced as a simpler alternative to
SOAP (Simple Object Access Protocol).
 
 
 
Neo4J
-----
 
Neo4j is an open-source graph database, implemented in Java. 
Neo4j is a "disk-based, Java persistence engine
that stores data structured in graphs rather than in tables". 
Neo4j is the most popular graph database.
 
 
Redis
-----
 
Redis is an open-source, in-memory, key-value data store.
It is written in ANSI C.
 
Redis is accessible through almost any programming language.
The official way of saying this is:
Many languages have Redis bindings.
 
OK, you asked for it:
C, C++, C#, Clojure, Common Lisp, Dart, Erlang, Go, Haskell, 
Haxe, Io, Java, JavaScript (Node.js), Lua, Objective-C, 
Perl, PHP, Pure Data, Python, R, Ruby, Scala, Smalltalk
and Tcl
 
 
Redis supports 
Lists of strings
Sets of strings (non-repeating and unordered)
Sorted sets of strings ordered by a score number
Key-value pairs (called hashes).
 
Redis typically holds the whole dataset in memory.
But there are two persistence mechanisms.