Exporting Cassandra 2.2 to Google BigQuery

So we decide to move 5 years of data from Apache Cassandra to Google BigQuery. The problem was not just transferring the data or export/import, the issue was the very old Cassandra!

After extensive research, we have planned the migration to export data to csv and then upload in Google Cloud Storage for importing in Big Query.

The pain was the way Cassandra 1.1 deal with large number of records! There is no pagination so at some point your gonna run out of something! If not mistaken, pagination is introduced since version 2.2.

After all my attempts to upgrade to latest version 3.4 failed I decide to try other versions and luckily the version 2.2 worked! By working I mean I were able to follow the upgrading steps to end and the data were accessible.

Because I could not get any support for direct upgrade and my attempts to simply upgrade to 2.2 also failed. So I had no choice but to upgrade to 2.0 and then upgrade it to 2.2. Because this is extremely delicate task I rather just forward you to official website and only then give you the summary. Please make sure you check docs.datastax.com and follow their instructions.

To give an overview, you are going to do these steps:

  1. Making sure all nodes are stable and there is no dead nodes.
  2. Make backup (your SSTables, configurations and etc)
  3. It is very important to successfully upgrade your SSTable before proceeding to next step. Simply use
    nodetool upgradesstables
  4. Drain the nodes using
    nodetool drain
  5. Then simply stop the node
  6. Install the new version (I will explain fresh installation later in this document)
  7. Simply do the config as your old Cassandra, start it and upgradesstables again (as in step 3) for each node.

Installing Cassandra:

  1. Edit /etc/yum.repos.d/datastax.repo
  2. [datastax]
    name = DataStax Repo for Apache Cassandra
    baseurl = https://rpm.datastax.com/community
    enabled = 1
    gpgcheck = 0
    
  3. And then install and start the service:
  4. yum install dsc20
    service cassandra start
    

Once you are upgrade to Cassandra 2+ you can export the data to csv without having pagination or crashing issue.

Just for the records, a few commands to get the necessary information about the data structure is as follow:

cqlsh -u username -p password
describe tables;
describe table abcd;
describe schema;

And once we know the tables we want to export we just use them alongside its keyspace. First add all your commands in one file to create a batch.

vi commands.list

For example a sample command to export one table:

COPY keyspace.tablename TO '/backup/export.csv';

And finally run the commands from the file:

cqlsh -u username -p password -f /backup/commands.list

So by now, you have exported the tables to csv file(s). All you need to do now is uploading the files to Google Cloud Storage:

gsutil rsync /backup gs://bucket

Later on you can use Google API to import the csv files to Google BigQuery. You may check out the Google documentations for this in cloud.google.com

Advertisements

Preparing web server in Amazon Linux

Once you install an EC2 instance with Amazon Linux AMI you will need to install web server and other requirement of setting up a website.

We are going to install httpd (webserver), php (scripting), mysqli (db client) and memcached (caching).

First things first, we do a clean up in case this has been used before:

yum update -y 
yum remove -y mysql*
yum remove -y http*
yum remove -y php*

and let the installation begin:

yum install -y httpd24 
yum install -y mod24_security memcached 
yum install -y php55 php55-mysqlnd php55-pecl-memcached php-pear php55-gd php55-devel php-channel-phpunit php-channel-amazon php55-cli  
yum install -y mysql55 mysql55-devel mysql55-libs 

Next, we need to do some httpd configurations like enabling mod-rewrite (if we need it!) and turning off signatures for security:

vi /etc/httpd/conf/httpd.conf

We a re going to add this to end of the file:

# Disable ETag headers
FileETag none

# Hide Apache and PHP info
Header unset Server
Header unset X-Powered-By

# Don't expose server versions
ServerSignature Off
ServerTokens Prod

# Enable server-status for internal IP
<Location /server-status>
   SetHandler server-status
   Order Deny,Allow
   Deny from all
   Allow from 127.0.0.1
</Location>

# KeepAlive: Whether or not to allow persistent connections (more than
# one request per connection). Set to "Off" to deactivate.
KeepAlive On

# Configure /var/www/html
<Directory "/var/www/html">
	Options FollowSymLinks
	AllowOverride All
	DirectoryIndex index.html index.php
	Order allow,deny
	Allow from all
</Directory>

We also need to do minor php configurations to enable sessions being saved in centralized memcached server (if you dont want to store sessions in memcached then just ignore this):

vi /etc/php.ini

And adding/editing the followings. The time zone should be changed to your own timezone, memory limit is 128 by default and you may want to increase it.

date.timezone = Asia/Kuala_Lumpur
memory_limit = 256M
session.save_handler = memcached
session.save_path = "192.168.1.2:11211"

*Please note, in case you are going to use memcache instead of memcached (which are two different things) you should specify protocol in save_path (i.e. tcp://192.168.1.2:11211).

At the end we need to do some Linux stuff!! Here is the explanation in order: making www group, adding current user to that group, take ownership of website directory, grant default permissions to website directory, grant specific permission to directories, grant specific permission to files, auto-start httpd service, auto-start memcached service, start httpd, start memcached.

groupadd www
usermod -a -G www ec2-user
chown -R root:www /var/www
chmod 2775 /var/www
find /var/www -type d -exec chmod 2775 {} +
find /var/www -type f -exec chmod 0664 {} +
chkconfig httpd on
chkconfig memcached on
service httpd start
service memcached start

echo "<?php phpinfo(); ?>" > /var/www/html/info.php

And we are all set ;)