Filehash

Handling large amounts of data in R is tricky as R typically loads the entire dataset in to RAM. While this means that computations are going to be very fast, it also means that dataset that can be used for analysis is dependent on your RAM. One solution which I stumbled on was Filehash which seemed to offer me a solution than the usual way that I was going about. I was getting the entire stuff in postgreSQL and then doing computations using R.

Filehash seemed to be great as I don’t have to bring another RDBMS in to the work loop. However after spending 1-2 hours on it, I realized it was not that useful for me. It is great if you have key value pairs, but not really great if you have too many columns with dependencies and there is a lot of subsets that you need to work on. In any case, filehash was nothing but DB2 sitting at the back of R. Why should I put my data in to DB2, when a powerful postgreSQL is at my disposal. I have ditched Filehash. There is another package called ff , but sadly it is more useful if you want to sample stuff from a big file. In summary , it appears to me that the only way out is Hadoop-R/Mapreduce-R. That’s a steep learning curve for me.

Why should one think of fancy systems like hadoob?
Fact :On a typical day, 6 lakh NIFTY futures are traded in 20,100 seconds of a trading day ,meaning, 30 trades per second.Now each of these 30 trades happen at various price points and quantities. If you want to track the average futures price(quantity weighted) every second for a SINGLE day , it is easy. If you want to do it for a MONTH, it might take some time. If you want to do it for an YEAR and also calculate a 5 minute moving average of futures price, taking in to consideration gapups and gap downs across days, we are talking abt lots of computations on lots of data points. Even in such a simple application using RDBMS systems is a PAIN. I am guilty of using them now, but it will be of no use as things scale.