Skip to content

Notes on the Cassandra Data Model

08/17/2011

Cassandra’s data model is different enough from a relational (traditional) database that it may often cause confusion.

Eric Evans of Rackspace explains the Cassandra data model using Twitter as an example in his article Cassandra By Example.

In a traditional database, a Twitter data model might have a table for users, tables for followers and followees and a table for tweets. Combining data from multiple tables would mean creating indices on the appropriate attributes to make joins efficient when combining data.

To translate such a model to Cassandra we could look at Twissandra, a twitter clone. Cassandra’s data model has a number of characteristics that are of note. These follow after the jump.

  • Cassandra is a schema-less data store.
  • Its keyspaces are the uppermost namespaces and there is typically one for each application.
  • Each keyspace has one or more column families, which serve to associate records of a similar kind.
  • Cassandra has record-level atomicity (i.e. each record must be processed in its entirety or not at all).
  • Column families are defined in the main config, though it will become possible to create them on-the-fly in future versions. A column family specifies a comparator: the sort order of records is a design decision and is not easily changed later.

Twissandra has the following seven column families:

  • User: each record is keyed on UUID contains columns for both username and password.
  • Username: looking up a user requires knowing the user’s key. In a RDB,one might perform a SELECT operation on the user table. This involves sequentially scanning the table with a predicate to match the given username. Performing such an operation on a RDB or in Cassandra can become inefficient, meaning a secondary index for the username attribute would have to be created. Whereas a RDB supports this, Cassandra does not. Thus, we have to create an inverted index that maps usernames to the UUID-based key. Thus, the necessity of this username column.
  • Friends/Followers: These two column families are called when a user asks who is user x following? and who is following user x? ”Each is keyed on the unique user ID, with columns to track the corresponding relationships and the time they were created.”
  • Tweets: tweets are stored in records within this column family, each with a UUID, columns for user ID, the tweet text and the time the tweet was added.
  • Userline: the timeline for each user is stored in this family. records are user ID keys and columns map a timestamp to the unique tweet ID taken from the tweet column family.
  • Timeline: similar to userline, but instead stores the view of friend tweets for each user.

Adding a new user

We must add new users to our database upon signup. In the Twissandra example, this would look something like:

username = 'cloudproject' //taken from website form entries

password = 'SECRET' //taken from website form entries

useruuid = str(uuid()) // forces the data into a string

columns = {'id': useruuid, 'username': username, 'password': password}

USER.insert(useruuid, columns) // cassandra operation to create a new record in the user column family

USERNAME.insert(username, {'id': useruuid}) // cassandra operation to update inverted index that maps readable usernames to heir corresponding UUIDs.

The above pseudocode would work with Pycassa, a python client library for Cassandra. In the case of this library, USER and USERNAME are pycassa column family instances, created at the initialisation of the database.

Following a friend

frienduuid = 'example_uuid'

FRIENDS.insert(useruuid, {frienduuid: time.time()}) // add a user to friends list

FOLLOWERS.insert(frienduuid, {useruuid: time.time()}) // track the inverse of the relationship (add to the list of a target user's followers)

Tweeting

tweetuuid = str(uuid()) // New tweet uuid

body = '@jericevans thanks for helping me learn Cassandra!' // Tweet text

timestamp = long(time.time() * 1e6) // Timestamp of tweet

columns = {'id': tweetuuid, 'user_id': useruuid, 'body': body, '_ts': timestamp} // Mapping of respective columns

TWEET.insert(tweetuuid, columns) // Create new record (if first tweet), else insert the tweet into the tweets record

columns = {struct.pack('>d'), timestamp: tweetuuid}

USERLINE.insert(useruuid, columns) // Update userline with user's tweet.

TIMELINE.insert(useruuid, columns) // Update timeline with columns to map time to tweet ID.

for otheruuid in FOLLOWERS.get(useruuid, 5000):

TIMELINE.insert(otheruuid, columns) // insert on timeline of all followers

Storing a new tweet means creating a new record in the tweet column family, with a new UUID as the key, columns for the author’s user ID, and the tweet text.

Retrieving Tweets

timeline = USERLINE.get(useruuid, column_reversed=True)

tweets = TWEET.multiget(timeline.values())

Obtain a list of of IDs from the userline, then fetch them from tweet column family with a multiget() operation.

Retrieving a timeline for a user

start = request.GET.get('start')

limit = NUM_PER_PAGE

timeline = TIMELINE.get(useruuid, column_start=start, column_count=limit, column_reversed=True)

tweets = TWEET.multiget(timeline.values())

Here we do the same as when retrieving tweets, but from the timeline and  with start and limit to control the range returned.

That concludes my review of Eric Evans’ fantastic guide ‘Cassandra by Example’

Twissandra source can be found here.

I hope to attempt to develop code for retweets and lists.

Advertisements
No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: