Dive in to Python–Summary–Part II
Chapter 6 - Exceptions and File Handling
try ...except
has the same function as the try catch block in other languagestry .. except.. else
If no exception is raised in the try block, the else clause is executed afterwards- `` try… finally`` the code in the finally block will always be executed, even if something in the try block raises an exception
- Most other languages don’t have a powerful list datatype like Python. So, you don’t need to use for loop that oftern
os.environ
is a dictionary pf the environment variables defined on your system.- sys module contains system-level information such as the version of Python you’are using etc
- sys module is also a dictionary
- Given the name of any previously imported module, you can get a reference to the module itself through sys.modules dictionary
- The split function splits a full pathname and returns a typle containing the path and filename
- splitext functions splits a filename and returns a tuple containing the filename and the file extension
- isfile and isdir are useful to check whether the object is a file or a directory
- glob module helps in reading filtering files from a folder
import os
import time
import glob
y = [os.path.split(f)[1] for f in glob.glob(“C:\\Cauldron\\Personal\\Songs\\Eagles\\D2\\*.mp3”)]
for i in y :
print i
01 - Life In The Fast Lane.mp3
02 - Wasted Time.mp3
03 - Victim Of Love.mp3
04 - The Last Resort.mp3
05 - New Kid In Town.mp3
06 - Please Come Home For Christmas.mp3
07 - Heartache Tonight.mp3
08 - The Sad Cafe.mp3
09 - I Cant Tell You Why.mp3
10 - The Long Run.mp3
11 - In The City.mp3
12 - Those Shoes.mp3
13 - Seven Bridges Road (Live).mp3
14 - Love Will Keep Us Alive.mp3
15 - Get Over It.mp3
16 - Hole In The World.mp3
fileinfo.py
has taught me a lot about Python syntax and OOPS concepts. I think it will take a looooong time before I manage to write a program that is as elegant and succinct asfileinfo.py
Chapter 7 - Regular Expressions
This chapter introduces Regular expressions in a superb manner by using three case studies. First one involves parsing street addresses, second one involves parsing roman numerals and third one involves parsing telephone numbers. All the Regex 101 aspects are discusses such as
-
^
matches the beginning of the string -
$
matches the end of the string -
\b
matches a word boundary -
\d
matches any numeric didit -
\D
matches any non-numeric character -
x?
matches any optional x character -
x*
matches x zero or more times -
x+
matches x one of more times -
x{n,m}
matches an x character atleast n timesm, but not more than m times -
(a|b|c)
matches either a or b or c -
(x)
in general is a remembered group. You can get the value of what is matched by using groups function
Chapter 8 - HTML Processing
The chapter starts off by showing a program that looked overwhelming to me. Its a program that parses an external html and converts the text in to various languages and renders it in to another translated html. So, at the outset reading through the program I did not understand most of things. Basically that’s the style maintained through out the book. Introduce a pretty involved program and explain each of the steps involved in the program.
So, the book starts off by talking about SGMLParser that takes in a html document and consumes it. Well, that’s all it does. So, what’s the use of such a class? One has to subclass it and provide the methods so that one can do interesting things. For example one can specify start_a
function and list all the urls in a page. This means instead of manually ploughing through the data to find all a hrefs , you can extend this function and get all the links in the page.If a method has not been defined for a specific tag, unknown_starttag'' method is invoked. The chapter then talks about locals and globals , functions that are useful in string formatting. So, the basic structure of this chapter is ,start with SGMLParser ,subclass it and create a BaseHTMLProcessor, subclass it to create Dialectizer, and then subclass it to create various Language specific Dialectizers. One gets to understand the way to make a program extensible by reading carefully the ``dialect.py
. This chapter makes one realize the power of sgmllib.py
to manipulate HTML by turning its structure in to an object model. This module can be used in many different ways, some of them being
-
parsing the HTML looking for something specific
-
aggregating the results, like the URL lister
-
altering the structure along the way
-
transforming the HTML in to something else by manipulating the text while leaving the tags alone
After going through this chapter, I learnt to write a basic link crawler, given the depth of the crawl
import urllib
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = \[\]
def start\_a(self, attrs):
href = \[v for k,v in attrs if k=="href"\]
if href :
self.urls.extend(href)
def get_all_links(link_input) :
usock = urllib.urlopen(link\_input)
parser = URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
return parser.urls
def union(p,q):
for e in q:
if e not in p:
p.append(e)
def check_link(item,tocrawl):
for i,j in tocrawl:
if i==item : return True
return False
def crawl_web(seed,max_depth):
depth = 0
tocrawl = \[\[seed,depth\]\]
crawled = \[\]
while tocrawl:
page = tocrawl.pop()
if page\[1\] > max\_depth : continue
if page\[0\] not in crawled:
links = get\_all\_links(page\[0\])
for item in links :
status = check\_link(item,tocrawl)
if status==False : tocrawl.append(\[item,page\[1\]+1\])
crawled.append(page\[0\])
result =\[\]
for item in crawled :
result.append(item)
return result
start = time.clock()
#print crawl_web(“http://codekata.pragprog.com/",1)
end = time.clock()
print ‘The time taken for finding anagrams is ‘,end-start
#The time taken for finding anagrams is 222.981091936
The time taken for finding anagrams is 4.46656485876e-07
Subsequently I used Audacity’s code to generate the links
seed = r’http://www.udacity.com/cs101x/index.html'
import urllib
import re
import time
def scrape_url(seed):
res = \[\]
f = urllib.urlopen(seed)
for line in f.fp:
temp = line.strip()
res.append(temp)
return " ".join(res)
#print scrape_url(seed)
def get_next_target(page):
start\_link = page.find("")
if start\_link == -1:
return None, 0
start\_quote = page.find('"', start\_link)
end\_quote = page.find('"', start\_quote + 1)
url = page\[start\_quote + 1:end\_quote\]
return url, end\_quote
def get_all_links(page):
links = \[\]
while True:
url,endpos = get\_next\_target(page)
if url:
links.append(url)
page = page\[endpos:\]
else:
break
return links
#print get_all_links(scrape_url(seed))
def union(p,q):
for e in q:
if e not in p:
p.append(e)
def check_link(item,tocrawl):
for i,j in tocrawl:
if i==item : return True
return False
def crawl_web(seed,max_depth):
depth = 0
tocrawl = \[\[seed,depth\]\]
crawled = \[\]
while tocrawl:
page = tocrawl.pop()
if page\[1\] > max\_depth : continue
if page\[0\] not in crawled:
links = get\_all\_links(scrape\_url(page\[0\]))
for item in links :
status = check\_link(item,tocrawl)
if status==False : tocrawl.append(\[item,page\[1\]+1\])
crawled.append(page\[0\])
result =\[\]
for item in crawled :
result.append(item)
return result
start = time.clock()
#print crawl_web(“http://codekata.pragprog.com/",1)
end = time.clock()
print ‘The time taken for finding anagrams is ‘,end-start
#The time taken for finding anagrams is 108.015982502 seconds
The time taken for finding anagrams is 4.46656485876e-07
It took half the time that SGMLParser took. So, there is a downside always. Using prebuilt modules is not always the best choice.
Chapter 9 - XML Processing
The chapter starts with the 250 line program that was overwhelming for me to go through. However the author promises that he would take the reader carefully over all aspects of the 250 line code. After this mega code, the chapter starts talking about packages and the need for organizing Python programs in packages. XML package uses unicode to store all parsed XML data and hence the chapter then dwells on the history of unicode. Python uses ascii
encoding scheme whenever it needs to auto-coerce a unicode string in to a regular string. The last two sections of the chapter talk about searching for elements and accessing element attributes in an XML document. Overall, this chapter shows that accessing and reading XML document in Python is made easy by the xml module.
Chapter 10 - Scripts and Streams
-
One of the powerful use of dynamic binding is the file-like object
-
A file-like object is any object with a read method with an optional size paramter
-
file-like objects are useful in the sense that the source could be anything , a local file, a remote xml document, a string
-
Standard output and error are pipes that are built in to every UNIX system. when you print something, it goes to the
stdout pipe, when your program crashes and print out debugging information, it foes to the stderr pipe
-
stdout and stderr are both file-like objects. They are oth write-only
-
In windows based IDE, stdout and stderr default to interactive window
-
To read command line arguments, either you can import sys and use the iterator sys.argv or use getopt module
I have skipped Chapters 11 and 12 that are based on web services and SOAP. Will refer it to get some general idea at a later date.
There are 6 more chapters in this book that I plan to go over by this week.