book_cover

Chapter 6 - Exceptions and File Handling

  • try ...except has the same function as the try catch block in other languages
  • try .. except.. else If no exception is raised in the try block, the else clause is executed afterwards
  • `` try… finally`` the code in the finally block will always be executed, even if something in the try block raises an exception
  • Most other languages don’t have a powerful list datatype like Python. So, you don’t need to use for loop that oftern
  • os.environ is a dictionary pf the environment variables defined on your system.
  • sys module contains system-level information such as the version of Python you’are using etc
  • sys module is also a dictionary
  • Given the name of any previously imported module, you can get a reference to the module itself through sys.modules dictionary
  • The split function splits a full pathname and returns a typle containing the path and filename
  • splitext functions splits a filename and returns a tuple containing the filename and the file extension
  • isfile and isdir are useful to check whether the object is a file or a directory
  • glob module helps in reading filtering files from a folder

import os

import time

import glob

y = [os.path.split(f)[1] for f in glob.glob(“C:\\Cauldron\\Personal\\Songs\\Eagles\\D2\\*.mp3”)]

for i in y :

print i

01 - Life In The Fast Lane.mp3

02 - Wasted Time.mp3

03 - Victim Of Love.mp3

04 - The Last Resort.mp3

05 - New Kid In Town.mp3

06 - Please Come Home For Christmas.mp3

07 - Heartache Tonight.mp3

08 - The Sad Cafe.mp3

09 - I Cant Tell You Why.mp3

10 - The Long Run.mp3

11 - In The City.mp3

12 - Those Shoes.mp3

13 - Seven Bridges Road (Live).mp3

14 - Love Will Keep Us Alive.mp3

15 - Get Over It.mp3

16 - Hole In The World.mp3

  • fileinfo.py has taught me a lot about Python syntax and OOPS concepts. I think it will take a looooong time before I manage to write a program that is as elegant and succinct as fileinfo.py

Chapter 7 - Regular Expressions

This chapter introduces Regular expressions in a superb manner by using three case studies. First one involves parsing street addresses, second one involves parsing roman numerals and third one involves parsing telephone numbers. All the Regex 101 aspects are discusses such as

  • ^ matches the beginning of the string

  • $ matches the end of the string

  • \b matches a word boundary

  • \d matches any numeric didit

  • \D matches any non-numeric character

  • x? matches any optional x character

  • x* matches x zero or more times

  • x+ matches x one of more times

  • x{n,m} matches an x character atleast n timesm, but not more than m times

  • (a|b|c) matches either a or b or c

  • (x) in general is a remembered group. You can get the value of what is matched by using groups function

Chapter 8 - HTML Processing

The chapter starts off by showing a program that looked overwhelming to me. Its a program that parses an external html and converts the text in to various languages and renders it in to another translated html. So, at the outset reading through the program I did not understand most of things. Basically that’s the style maintained through out the book. Introduce a pretty involved program and explain each of the steps involved in the program.

So, the book starts off by talking about SGMLParser that takes in a html document and consumes it. Well, that’s all it does. So, what’s the use of such a class? One has to subclass it and provide the methods so that one can do interesting things. For example one can specify start_a function and list all the urls in a page. This means instead of manually ploughing through the data to find all a hrefs , you can extend this function and get all the links in the page.If a method has not been defined for a specific tag, unknown_starttag'' method is invoked. The chapter then talks about locals and globals , functions that are useful in string formatting. So, the basic structure of this chapter is ,start with SGMLParser ,subclass it and create a BaseHTMLProcessor, subclass it to create Dialectizer, and then subclass it to create various Language specific Dialectizers. One gets to understand the way to make a program extensible by reading carefully the ``dialect.py. This chapter makes one realize the power of sgmllib.py to manipulate HTML by turning its structure in to an object model. This module can be used in many different ways, some of them being

  • parsing the HTML looking for something specific

  • aggregating the results, like the URL lister

  • altering the structure along the way

  • transforming the HTML in to something else by manipulating the text while leaving the tags alone

After going through this chapter, I learnt to write a basic link crawler, given the depth of the crawl

import urllib

from sgmllib import SGMLParser

class URLLister(SGMLParser):

def reset(self):

    SGMLParser.reset(self)

    self.urls = \[\]

def start\_a(self, attrs):

    href = \[v for k,v in attrs if k=="href"\]

    if href :

        self.urls.extend(href)

def get_all_links(link_input) :

usock = urllib.urlopen(link\_input)

parser = URLLister()

parser.feed(usock.read())

usock.close()

parser.close()

return parser.urls

def union(p,q):

for e in q:

    if e not in p:

        p.append(e)

def check_link(item,tocrawl):

for i,j in tocrawl:

    if i==item : return True

return False

def crawl_web(seed,max_depth):

depth   = 0

tocrawl = \[\[seed,depth\]\]

crawled = \[\]

while tocrawl:

    page = tocrawl.pop()

    if page\[1\] > max\_depth : continue

    if page\[0\] not in crawled:

        links = get\_all\_links(page\[0\])

        for item in links :

            status = check\_link(item,tocrawl)

            if status==False : tocrawl.append(\[item,page\[1\]+1\])

        crawled.append(page\[0\])

result =\[\]

for item in crawled :

    result.append(item)

return result

start = time.clock()

#print crawl_web(“http://codekata.pragprog.com/",1)

end = time.clock()

print ‘The time taken for finding anagrams is ‘,end-start

#The time taken for finding anagrams is 222.981091936

The time taken for finding anagrams is 4.46656485876e-07

Subsequently I used Audacity’s code to generate the links

seed = r’http://www.udacity.com/cs101x/index.html'

import urllib

import re

import time

def scrape_url(seed):

res = \[\]

f     = urllib.urlopen(seed)

for line in f.fp:

    temp = line.strip()

    res.append(temp)

return " ".join(res)

#print scrape_url(seed)

def get_next_target(page):

start\_link = page.find("")

if start\_link == -1:

    return None, 0

start\_quote = page.find('"', start\_link)

end\_quote = page.find('"', start\_quote + 1)

url = page\[start\_quote + 1:end\_quote\]

return url, end\_quote

def get_all_links(page):

links = \[\]

while True:

    url,endpos = get\_next\_target(page)

    if url:

        links.append(url)

        page = page\[endpos:\]

    else:

        break

return links

#print get_all_links(scrape_url(seed))

def union(p,q):

for e in q:

    if e not in p:

        p.append(e)

def check_link(item,tocrawl):

for i,j in tocrawl:

    if i==item : return True

return False

def crawl_web(seed,max_depth):

depth   = 0

tocrawl = \[\[seed,depth\]\]

crawled = \[\]

while tocrawl:

    page = tocrawl.pop()

    if page\[1\] > max\_depth : continue

    if page\[0\] not in crawled:

        links = get\_all\_links(scrape\_url(page\[0\]))

        for item in links :

            status = check\_link(item,tocrawl)

            if status==False : tocrawl.append(\[item,page\[1\]+1\])

        crawled.append(page\[0\])

result =\[\]

for item in crawled :

    result.append(item)

return result

start = time.clock()

#print crawl_web(“http://codekata.pragprog.com/",1)

end = time.clock()

print ‘The time taken for finding anagrams is ‘,end-start

#The time taken for finding anagrams is 108.015982502 seconds

The time taken for finding anagrams is 4.46656485876e-07

It took half the time that SGMLParser took. So, there is a downside always. Using prebuilt modules is not always the best choice.

Chapter 9 - XML Processing

The chapter starts with the 250 line program that was overwhelming for me to go through. However the author promises that he would take the reader carefully over all aspects of the 250 line code. After this mega code, the chapter starts talking about packages and the need for organizing Python programs in packages. XML package uses unicode to store all parsed XML data and hence the chapter then dwells on the history of unicode. Python uses ascii encoding scheme whenever it needs to auto-coerce a unicode string in to a regular string. The last two sections of the chapter talk about searching for elements and accessing element attributes in an XML document. Overall, this chapter shows that accessing and reading XML document in Python is made easy by the xml module.

Chapter 10 - Scripts and Streams

  • One of the powerful use of dynamic binding is the file-like object

  • A file-like object is any object with a read method with an optional size paramter

  • file-like objects are useful in the sense that the source could be anything , a local file, a remote xml document, a string

  • Standard output and error are pipes that are built in to every UNIX system. when you print something, it goes to the

    stdout pipe, when your program crashes and print out debugging information, it foes to the stderr pipe

  • stdout and stderr are both file-like objects. They are oth write-only

  • In windows based IDE, stdout and stderr default to interactive window

  • To read command line arguments, either you can import sys and use the iterator sys.argv or use getopt module

I have skipped Chapters 11 and 12 that are based on web services and SOAP. Will refer it to get some general idea at a later date.

There are 6 more chapters in this book that I plan to go over by this week.