Image ImageImage

As we all know, every kind of work involves a definite amount of tedious, unpleasant,low-mid IQ work.I don’t subscribe to these labels though.

In the life of a data analyst, the data preparation and data cleansing is something which has to be done before he/she can even think of doing any sane analysis. So, how does one approach data prep and data cleansing? Is it something to be looked down upon ? Well, I was under that impression many years back. My view changed completely when I took the effort to understand the complexity of data prep and data cleansing phases. Also, I began to develop this feeling that working on these phases gives a tremendous amount of insight that no modeling on the cleansed data would give. Here is a fact which is interesting : Renaissance hedge fund employs around 25 top notch PhDs to work on data treatment such as missing and capping treatment, outlier effects etc. No wonder this amount of care at the initial stages translates to tons of insights in to the arb opportunities in the market. Many years working on databases, I have kind of developed great love towards it. One of the things which is a pre-requisite to successful modeling is data prep phase. Working on data, which to this day, many people have an aversion is a fascinating place to start the analysis. However I am amateur in terms of the possibilities that lie in DBMS world. I have to understand and implement lots of things in this space, at least in relation to fin world. KDB+, Hadoob(TP keeps reminding of this fascinating world that I have not explored at all). Well, I guess it is a life long process.

Recently I was asked to quickly develop a database of brokers and sub brokers associated with NSE, as a side-activity when I am not doing my regular work. This is just one of those requests you get when people know that you program . Well, at the first sight of the task, it does seem as though it is a boring task. Well, all it takes is to screen scrape NSE site , dump the content in to delimited text file, read it back in to whatever dbase you want to . Well, if the task is to hand over a xls, just import the delimited text file and be done with it. This task of writing a program to screen scrape NSE was easy. However preparing the data from badly formatted html files is a little painful. As mundane as this task sounds to be, it was a pleasant experience. HOW?

  • Haven’t screen scraped NSE before,But took this little task as an opportunity to do just that..As they say, all it takes to recognize a good programmer from a bad one is to look at 100 lines of code . Some companies conduct rounds of interview, Phone interview, this type of interview, that type of interview, HR round (what do you want to do in life type of questions, which the interviewer himself would have no clue what he himself would want to do in life, 5 years down the lane questions , how do you see yourself ,,kind of questions for a PROGRAMMING job!! give me break!. In 5 years the skill sets I have could become completely obsolete and may be I will teach math:).. Any ways the point is, IT companies should just ask the interview candidate to code for 15 minutes and examine the code.It gives far better insights in to the candidate’s productivity than asking “what are your 3 strengths and 3 weaknesses”. Till this date I have never understood how these questions make sense in any interview( For heaven’s sake, I am not at a shrink’s place where quizzing abt my psyche is imp!)

    NSE definitely needs to improve its IT coding standards for the site is not being programmed properly. This is not some casual statement made by hear say. Screen scraping the relevant files shocked me to the kind of coding standards that were being followed!. Investigation of this sort is always fun.

  • Had used C++ earlier to screen scrape but it failed becoz the library that I was using was built for HTTP and not HTTPS. Well, I could rebuild it ..but gave it a thought and ditched it to favor C# library. It worked like a charm. C# is a language for people who just want to get some things done with no particular great concern about efficiency, performance etc. At least thats what I believe.There are tons of libraries that are build and you just have to use them. How much time does it take to get up to speed on C# screen scraping?.Not more than a couple of hours.

  • Thanks to the poor formatting of the files, had to resort to manual search of relevant tokens. Made it fun by using regex which I had extensively used probably a year ago. But you know, programming is like going to gym. You go to gym regularly and you are fit. You stop it for a month, you become lazy and you stop it for an year,forget about running for even 15 min on a treadmill.Your body will not cooperate. Programming is exactly like that. If you want to just type in with out looking too much in to the reference section of the language, with out Ctrl C+ Ctrl V(which most of the people indulge in, they call it programming, i call it effective copying), you got to program daily. If you don’t enjoy it, get out of programming and do something else. So Regex became something which I had stopped using a year ago. I mean I had conveniently avoided it , becoz..well, lets face it…doing it with the dumb Contains() or searching manually appears easier that regex. Using Regex needs you to think..Searching plain text for tokens appeared correct becoz I did not want to invest time in creating regex in the first place. A mundane activity like parsing a ill formatted html file gave me an opportunity to go back to gym( use regex which i had not used since a year) . Needless to say, it was like going on a treadmill after an year. However after spending 3-4 hours on regex, I was back to normalcy :).

  • So, here I was with a broker and sub broker database and was ready to hand it over and be done with it. I had spent on this task already close to 20 hours. Then I just looked at the data and thought , what would I want to think about this data and answer some subway ride questions, meaning, if you look at data on a subway ride, and you have nothing else to do but to wait till the train takes you to the destination, what questions would you ask about the data ? This is type of mindset where you think furiously whatever random/rational/senseless questions that you have about the data and quicky answer them. Well, I had so many other things to do ..Should I spend any more time on this broker/sub broker dbase..Then I thought, why not ? This is probably the only thing that I will recollect from this brief task , apart from just the usual hacking recollections. So, I spent an additional hour looking at Data. Just an additional hour greatly helped me answer some questions which will help some back of the envelope calculations about ALL BROKERS that are associated with NSE. Now that’s some useful info

  • Should I develop a rails interface ? That was a stretch..Have other important things to work on and hence decided to leave this task at this stage and handover the dbase.

    Sometimes, I do not know whether thinking in detail about any task hinders productivity or enhances productivity. Had I have not thought in detail about this task, may be I would have done it in lesser time and probably would have not had fun in the process.But one thing I do believe is that thinking about details at a micro^micro level makes any boring job a fun activity.