So we decide to move 5 years of data from Apache Cassandra to Google BigQuery. The problem was not just transferring the data or export/import, the issue was the very old Cassandra!
After extensive research, we have planned the migration to export data to csv and then upload in Google Cloud Storage for importing in Big Query.
The pain was the way Cassandra 1.1 deal with large number of records! There is no pagination so at some point your gonna run out of something! If not mistaken, pagination is introduced since version 2.2.
After all my attempts to upgrade to latest version 3.4 failed I decide to try other versions and luckily the version 2.2 worked! By working I mean I were able to follow the upgrading steps to end and the data were accessible.
Because I could not get any support for direct upgrade and my attempts to simply upgrade to 2.2 also failed. So I had no choice but to upgrade to 2.0 and then upgrade it to 2.2. Because this is extremely delicate task I rather just forward you to official website and only then give you the summary. Please make sure you check docs.datastax.com and follow their instructions.
To give an overview, you are going to do these steps:
- Making sure all nodes are stable and there is no dead nodes.
- Make backup (your SSTables, configurations and etc)
- It is very important to successfully upgrade your SSTable before proceeding to next step. Simply use
- Drain the nodes using
- Then simply stop the node
- Install the new version (I will explain fresh installation later in this document)
- Simply do the config as your old Cassandra, start it and upgradesstables again (as in step 3) for each node.
- Edit /etc/yum.repos.d/datastax.repo
[datastax] name = DataStax Repo for Apache Cassandra baseurl = https://rpm.datastax.com/community enabled = 1 gpgcheck = 0
- And then install and start the service:
yum install dsc20 service cassandra start
Once you are upgrade to Cassandra 2+ you can export the data to csv without having pagination or crashing issue.
Just for the records, a few commands to get the necessary information about the data structure is as follow:
cqlsh -u username -p password describe tables; describe table abcd; describe schema;
And once we know the tables we want to export we just use them alongside its keyspace. First add all your commands in one file to create a batch.
For example a sample command to export one table:
COPY keyspace.tablename TO '/backup/export.csv';
And finally run the commands from the file:
cqlsh -u username -p password -f /backup/commands.list
So by now, you have exported the tables to csv file(s). All you need to do now is uploading the files to Google Cloud Storage:
gsutil rsync /backup gs://bucket
Later on you can use Google API to import the csv files to Google BigQuery. You may check out the Google documentations for this in cloud.google.com