Exporting Cassandra 2.2 to Google BigQuery

So we decide to move 5 years of data from Apache Cassandra to Google BigQuery. The problem was not just transferring the data or export/import, the issue was the very old Cassandra!

After extensive research, we have planned the migration to export data to csv and then upload in Google Cloud Storage for importing in Big Query.

The pain was the way Cassandra 1.1 deal with large number of records! There is no pagination so at some point your gonna run out of something! If not mistaken, pagination is introduced since version 2.2.

After all my attempts to upgrade to latest version 3.4 failed I decide to try other versions and luckily the version 2.2 worked! By working I mean I were able to follow the upgrading steps to end and the data were accessible.

Because I could not get any support for direct upgrade and my attempts to simply upgrade to 2.2 also failed. So I had no choice but to upgrade to 2.0 and then upgrade it to 2.2. Because this is extremely delicate task I rather just forward you to official website and only then give you the summary. Please make sure you check docs.datastax.com and follow their instructions.

To give an overview, you are going to do these steps:

  1. Making sure all nodes are stable and there is no dead nodes.
  2. Make backup (your SSTables, configurations and etc)
  3. It is very important to successfully upgrade your SSTable before proceeding to next step. Simply use
    nodetool upgradesstables
  4. Drain the nodes using
    nodetool drain
  5. Then simply stop the node
  6. Install the new version (I will explain fresh installation later in this document)
  7. Simply do the config as your old Cassandra, start it and upgradesstables again (as in step 3) for each node.

Installing Cassandra:

  1. Edit /etc/yum.repos.d/datastax.repo
  2. [datastax]
    name = DataStax Repo for Apache Cassandra
    baseurl = https://rpm.datastax.com/community
    enabled = 1
    gpgcheck = 0
  3. And then install and start the service:
  4. yum install dsc20
    service cassandra start

Once you are upgrade to Cassandra 2+ you can export the data to csv without having pagination or crashing issue.

Just for the records, a few commands to get the necessary information about the data structure is as follow:

cqlsh -u username -p password
describe tables;
describe table abcd;
describe schema;

And once we know the tables we want to export we just use them alongside its keyspace. First add all your commands in one file to create a batch.

vi commands.list

For example a sample command to export one table:

COPY keyspace.tablename TO '/backup/export.csv';

And finally run the commands from the file:

cqlsh -u username -p password -f /backup/commands.list

So by now, you have exported the tables to csv file(s). All you need to do now is uploading the files to Google Cloud Storage:

gsutil rsync /backup gs://bucket

Later on you can use Google API to import the csv files to Google BigQuery. You may check out the Google documentations for this in cloud.google.com