10gen Mongo For Developers - Schema Design - Week 3 Notes

Here are my notes for week 3 of the 10Gen Mongo-db course: Schema Design.

You can find week 2's notes here.

Mongo-db doesn't support joins directly in the kernel: joins are done in the application themselves through code, which is tedious. Mongo can pre-join, or embed, data in documents. There are no constraints in Mongo, but embedding makes constraints unnecessary. Atomic operations are supported within a document, but not across documents. Mongo does not have a declared schema, even though documents within a collection will tend have a similar, if not identical, structure.

A goal of normalization in the relational database is avoiding bias toward any particular access pattern: it should be application-agnostic. Mongo is in opposition to this.

Although embedding data in documents sounds like it would cause modification anomalies, there are ways to embed that avoid modification anomalies. These ways are those which should be pursed in schematic design.

If you notice that your schema in Mongo has been designed pretty identical to how you would have done it in a relational database, you know you screwed up. Aside from the fact that you cannot do joins in mongo, and therefore you are now performed these joins in code, which is tedious, you have to load all the data into memory instead of selectively accessing it. It is better to embed in a manner that does not cause data anomalies.

Data consistency in a relational database is enabled through foreign key constraints. There are no constraints in Mongo. There is no guarentee of data integrity. Doctor E.F. Codd would not be happy. If you are going to perform a join through code in the application, you must check to see if the join is valid. The solution to the lack of constraints in Mongo is embedding.

Mongo-db has a lack of transaction support. However, Mongo has atomic operations within a single document: when you work on a single document, other users will see either all the changes you made to the document, or none at all. In a relational database with joins across multiple tables, transactions are necessary to lock access to the rows being modified across tables. But because Mongo embeds and doesn't join across collections, the supported atomic operations theoretically gives the same thing.

There are three approaches to overcome the lack of transaction support within Mongo: restructure, implement in software, or tolerate. Restructure means to implement 100% embedding in the schema. Implementing in software means to not embed fully but use joins in software and establish locks via application code. To tolerate means that if the application allows it, simply tolerate the inconsistency.

1:1 Relationships in Mongo

A 1:1 entity relationship where entity A is related to entity B can be modeled in Mongo by either the relational analogy or through embedding. The relational analogy consists of either entity A having a field that points to entity B, or entity B having a field that points to entity A. Embedding consists of either entity B being embedded in entity A, or entity A being embedded in entity B. How you choose to model it depends on how you access the data and how frequently you access each piece of data.

Frequence of Access: If you frequently access entity A but not B, and B is a rather large entity, you may want to keep them in different entities (not embedding), because you don't want to pull B into memory every time you pull A into memory. It is good in this scenario to keep them in separate collections to reduce the working set size of the application. This however will force you to make joins in application code, which is tedious.

Size of Items (growing or not growing): "Every time you add something to a document, there is a point beyond which the document will need to be moved in the collection." If you rarely updated entity A but updated entity B a lot, you may want to keep them separate. Also, if a document of entity B has BLOBs or historical information that make it and its corresponding document of entity A larger than 16mb, you definitely want to keep them separate, but this shouldn't occur to often.

Atomicity of data: If you knew you can't withstand any inconsistency, you would probably want to embed them so it can be updated atomically; otherwise, you would have to implement locking procedures in application code.

I must say, coming from Oracle/the relational model, this all makes me uncomfortable, for now.

1:M Relationships in Mongo

You will want to use the relational analogy if there is a relationship where entity A has a large number of Entity Bs. For example, San Jose has a million people. In this case, it is best to make two collections, City and People, and have a linking field in people relating back to city, and enforce the join in application code.

db.city.insert   ( {_id : 'SJC' , att1 : val1 , att2 : val2 } );  
db.person.insert ( { _id : person1 , name : 'Matthew', city_id : 'SJC' } );

In situations where there is a relationship where entity A has a small number of Entity Bs, you will want to embed rather than link. For example, a blog post has a smaller number of comments. Using one collection called Post with an embedded field of documents is better.

db.post.insert ( {_id : 01 , title : '10gen Mongo For Developers Week 3 Notes'  
                 ,comments : [{ name : 'Chandler' , text : 'My blog is better' }  
                             ,{ name : 'Anonymous' , text : 'This is uninformative'}  
                             ]  
                 } ) ;

In summary, it is recommended to represent a one to many relationship in multiple collections when the "many is large." Otherwise, if the "many is actually few" it is better to embed the many into the one.

M:N Relationships in Mongo

In a many to many relationship where there is "actually few to few," it is still best to link two collections together. Only one of the collections needs to have a field pointing to the other, depending on the preferred access pattern: If A has a M:N relationship with B, and it is better to quickly traverse from A to B, place a B array-field (pointing to B id's in A) in A; If it is better to get to A through B, place an A array-field (pointing to A id's) in B.

// Link A with B by placing an A array-field in B (pointing to A id's)  
db.A.insert ( { _id : 'A01', att1 : val1 } ) ;  
db.B.insert ( { _id : 'B01', att1 : val1,   
              , As : ['A01']  
              } );

However, it is OK to link in both directions, but not recommended because it facilitates data inconsistencies if it not tied together well, but it can be done for performance reasons.

// Link A with B by placing an A array-field in B and a B array-field in A  
db.A.insert ( {_id : A01, att1: val1 }  
              , Bs : ['B01']  
              } );  
db.B.insert ( { _id : B01, att1 : val1,   
              , As : ['A01']  
              } );

Another option instead of linking A to B by placing an A array-field in B (pointing to A id's) is embedding, at the risk of duplicating data: place a document of Bs in A. This will facilitate major update anomalies, but will be good for performance.

// Link A with B by placing document of B's in A  
db.A.insert ( { _id : A01, att1 : val1 }  
              , Bs : [{ att1: val }  
                     ,{ att1: val }]  
              } ) ;

Multikey Indexes in Mongo

Linking and Embedding works well in Mongo-db because Mongo supports multi key indexes.

When you index an array-field, you get a multikey index where mongo indexes all the values in the array, for all the documents in that collection that use that array-field

Lets back up to M:N relationships. Ideally, either entity A or B will have an array-field pointing back to the others IDs. For example, entity A might have a field like { Bs : ['B01', 'B02', 'B03'] ). There are two obvious queries for this hypothetical. Find all Bs that belong to a particular A, and find all As that belong to a particular B

// Find all Bs that belong to a particular A: this is fast  
db.A.find( { _id : 'A01' }, { Bs : true } );

// Find all As that belong to some Bs  
// This is slow unless there is a multikey index  
db.A.find ( { Bs : 'B01' } ) ;

The following syntax establishes a multikey index using the ensureIndex() method:

db.collectionName.ensureIndex( { 'arrayFieldName' : 1 } ) ;

To see if a query used an index, use the explain() method:

db.A.find ( { Bs : { $all : ['A01' , 'B01'] } } ).explain() ;

db.collection.name.find({ ' arrayFieldName' : { '$all' : [1',3]}})

Benefits of Embedding in Mongo

The main benefit of embedding data is performance, which comes from imrpoved read performance -- one round trip to the DB. Compuyter systems have spinning disks which have a very high latency: they take a long time (over 1ms) to get to the first byte. Each additional byte comes pretty quickly (high bandwidth). The theory is if you can co-locate data close to each other by embedding, read performance will increase. The only caveat is that if the document gets moved often you can slow down your writes because of embedding.

Trees in Mongo

I didn't understand this too well and will need to read more about it. It appears that rather than having a parent_id field in the document as you would in a relational database, it is best to use an "ancestor array", which lists the parent, grand-parent, great-grand-parent, in order, in an array. This appears to me that it would create data duplication and update anomalies.

When do Denormalize in Mongo

One of the purposes of normalization in the relational database is do avoid modification anomalies that come with the duplication of data. Embedding in Mongo is anagolous to denormalizing, but it does not have to introduce anomalies. As long as data isn't duplicated, there will be no problems with modification anomalies. With 1:1 relationships, embedding will not produce duplicate data. With 1:M relationships, embedding the many inside of the one will not duplicate data. M:N relationships will not duplicate if you link. If you decide to embed with M:N or to embed the one inside the many of a 1:M relationship because you think it better resembles the access patterns, you must enforce constraints in application code.

BLOBs in Mongo

Mongo has a facility called GRIDFS that will break up a BLOB into chunks and store those chunks in a collection, and meta data about the chunks in a separate collection.