Exporting Cassandra 2.2 to Google BigQuery

So we decide to move 5 years of data from Apache Cassandra to Google BigQuery. The problem was not just transferring the data or export/import, the issue was the very old Cassandra!

After extensive research, we have planned the migration to export data to csv and then upload in Google Cloud Storage for importing in Big Query.

The pain was the way Cassandra 1.1 deal with large number of records! There is no pagination so at some point your gonna run out of something! If not mistaken, pagination is introduced since version 2.2.

After all my attempts to upgrade to latest version 3.4 failed I decide to try other versions and luckily the version 2.2 worked! By working I mean I were able to follow the upgrading steps to end and the data were accessible.

Because I could not get any support for direct upgrade and my attempts to simply upgrade to 2.2 also failed. So I had no choice but to upgrade to 2.0 and then upgrade it to 2.2. Because this is extremely delicate task I rather just forward you to official website and only then give you the summary. Please make sure you check docs.datastax.com and follow their instructions.

To give an overview, you are going to do these steps:

  1. Making sure all nodes are stable and there is no dead nodes.
  2. Make backup (your SSTables, configurations and etc)
  3. It is very important to successfully upgrade your SSTable before proceeding to next step. Simply use
    nodetool upgradesstables
  4. Drain the nodes using
    nodetool drain
  5. Then simply stop the node
  6. Install the new version (I will explain fresh installation later in this document)
  7. Simply do the config as your old Cassandra, start it and upgradesstables again (as in step 3) for each node.

Installing Cassandra:

  1. Edit /etc/yum.repos.d/datastax.repo
  2. [datastax]
    name = DataStax Repo for Apache Cassandra
    baseurl = https://rpm.datastax.com/community
    enabled = 1
    gpgcheck = 0
    
  3. And then install and start the service:
  4. yum install dsc20
    service cassandra start
    

Once you are upgrade to Cassandra 2+ you can export the data to csv without having pagination or crashing issue.

Just for the records, a few commands to get the necessary information about the data structure is as follow:

cqlsh -u username -p password
describe tables;
describe table abcd;
describe schema;

And once we know the tables we want to export we just use them alongside its keyspace. First add all your commands in one file to create a batch.

vi commands.list

For example a sample command to export one table:

COPY keyspace.tablename TO '/backup/export.csv';

And finally run the commands from the file:

cqlsh -u username -p password -f /backup/commands.list

So by now, you have exported the tables to csv file(s). All you need to do now is uploading the files to Google Cloud Storage:

gsutil rsync /backup gs://bucket

Later on you can use Google API to import the csv files to Google BigQuery. You may check out the Google documentations for this in cloud.google.com

Extending Linux physical partition in AWS EC2

Once I got to the problem that an old server migrated to AWS had used 100% of available storage. Lucky me, it was in AWS EC2 and I were able to simply create a snapshot of the EBS volume and then create another volume from the snapshot which had twice the size of old one.

*It is notable that data lost was intolerable!

Until here was easy, and I thought it is going to remain that way using resize2fs tool. But the tool simply said that there partition is already using full disk size! while there was unused space there. So after some search I found the http://litwol.com/content/fdisk-resizegrow-physical-partition-without-losing-data-linodecom article and adapt it to my case.

The difference of my case was that the server partition was DOS partition and not ex* partition. This is what I did to fix it:

List disks and write down “Start” and “Id” for /dev/xvda1 (These two piece of data are extremely important for successfully finishing the task). In my case Start was 63 and Id was 8e.

fidsk -l

And then delete the partition, and create larger a new one with exact same start point:

fdisk -c=dos -u=sectors /dev/xvdc

(The red characters are my response)

(fdisk) Command (m for help): d
(fdisk) Partition number (1-4): 1
(fdisk) Command (m for help): n
(fdisk) Partition type:
p primary
e extended
Select (default p): p
Partition number (1-4, default 1): 1
First sector (63-40558591, default 63): 63
Last sector, +sectors or +size{K,M,G} (63-40558591, default 40558591): 40558591
(fdisk) Command (m for help): t
Partition number (1-4): 1
Hex Code (type L to list codes): 8e
(fdisk) Command (m for help): w

Now we use resize2fs to finalize everything:

resize2fs /dev/xvdc1

At the end you can check the storage device by “lsblk” or “fdisk -l”.

Troubleshooting windows file sharing

A small note on troubleshooting windows 7 file sharing issues:

A. Using the safe approach to use the share folder (If this works and you still want to share the folder for anonymous access go to next points)
1. Click on “Map Network Drive” in My Computer.
2. Enter the exact path of the shared folder in “Folder” field
(i.e. \\COMPUTER_NAME_OR_IP\SHARED_FILER_NAME)
3. Enter authentication credentials by clicking on link “Connect Using a Different Username”, and then type the full username
(i.e. COMPUTER_NAME_OR_IP\USERNAME_ON_THAT_COMPUTER)
4. If any problem occurred, there might be a problem with firewall or the shared folder.

B. Make sure File and Printer Sharing is Turned On and Password Protected Sharing is Turned Off.
1. Go to Control Panel\All Control Panel Items\Network and Sharing Center\Advanced sharing settings
2. Go to the public profile and check the setting.
3. Type lusrmgr.msc in RUN. Make sure guest account is enabled.

C. Make sure Windows Group Policy is right.
1. Type gpedit.msc in RUN
2. Go to Computer “Configuration/Windows Settings/Security Settings/Local Policies/User Rights Assignment”
3. In policy “Access this computer from the network”, Check that EVERYONE is added to the list.
4. In policy “Deny access to this computer from the network”, and remove “GUEST” from the list.

D. Make sure the guest account is active and configure properly.
1. Open Command Prompt by typing CMD in RUN
2. Type the command “net user guest /active:yes”
3. Type the command “ntrights +r SeNetworkLogonRight -u Guest”
4. Type the command “ntrights -r SeDenyNetworkLogonRight -u Guest”