A few months ago, I had this thought of practicing Python
every day for 20
minutes. If you use Python
in your daily work, you should not rely on that
work a substitute for a deliberate practice session. This was also echoed by
Josh Kaufman in his book, The First Twenty Hours, where he could not rely on
daily work that involved typing as a substitute for a deliberate practice
session on touchtyping. If you are trying to learn touch typing, you might
assume that since you are anyway typing emails, reports etc, you are in essence
doing deliberate practice. Not really. Once you are in a deliberate practice
session, your focus become the craft itself unlike the outcome of the specific
task. Unless you set aside some time for the task on a regular basis, it is
difficult to improve in any skill, be it touch typing or coding python.
In any case, setting aside a 20 min time slot for going through the book,
“Effective Python” , helped me in reading this book slowly and digest all the
wonderful information present in it. In any case, this book cannot be consumed
in a few sittings. It will take quite amount of time to read, to think and
understand various ways in which one could improve the craft of coding
This blogpost summarizes some of the main points from the book.
Pythonic Thinking
Python version
1
2
|
import sys
sys.version_info
|
1
|
sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0)
|
Difference between str
and bytes
There are two types that represent sequences of character data: bytes
and
str
.
Instances of bytes
contain raw, unsigned 8-bit values.
1
2
|
a = b'h\x65llo'
a, list(a)
|
|
|
|
b |
hello |
(104 101 108 108 111) |
Instances of str
contain Unicode code points that represent textual characters
from human languages
1
2
|
a = 'a\u002A sdfdf'
a, list(a)
|
str
instances do not have an associated binary encoding, and bytes instances
do not have an associated text encoding. To convert Unicode data to binary data,
you must call the encode
method. To convert binary data to Unicode data, you
must call decode
method of bytes
1
2
3
4
5
6
7
8
9
|
def to_str(bytes_or_str):
if isinstance(bytes_or_str, bytes):
value = bytes_or_str.decode("utf-8")
else:
value = bytes_or_str
return value
return to_str('hello'), to_str(b'hello')
None
|
1
2
3
4
5
6
7
8
9
|
def to_bytes(bytes_or_str):
if isinstance(bytes_or_str, str):
value=bytes_or_str.encode('utf-8')
else:
value = bytes_or_str
return value
return to_bytes(b'foo'), to_bytes('bar')
|
- You can add two
str
instances or two bytes
instances but you cannot add a
byte
instance to a str
instance
- If the file is opened in
'r'
mode or 'w'
mode, it expects that the file is
in text mode. write operations expect str
instances and read operations
uses the system’s default text encoding to interpret data
- If you want to read or write unicode data to/from a file, be careful about
system’s default text encoding. Explicitly pass the
encoding
parameter to
open if you want to avoid surprises
- If you want to read or write binary data to/from a file, always open the file
using a binary mode(like ‘rb’ or ‘wb’)
bytes
and str
instances can’t be used together with operators like
(>,==, + and %)
- Use helper functions to ensure that the inputs you operate are the type of
character sequence that you expect(8-bit values, UTF-8-encoded strings,
Unicode points)
Python has four different ways to formatting strings that are built in to the
language and the standard library
- Use formatting operator
%
. These come from C’s printf
function
- One can use the
%
operator with a dict
1
2
3
|
a = 0b10111011
b = 0xc5f
return 'Binary is %d, hex is %d '%(a,b)
|
1
|
Binary is 187, hex is 3167
|
Python 3
added support for advanced string formatting that is more
expressive than the old C-style format strings that use the %
operator. For
individual python values, this new functionality can be accessed through the
format
built-in function.
1
2
|
a = 1234.4
return format(a, ',.2f')
|
1
2
3
|
key = "rk"
value = "45"
return '{}={}'.format(key, value)
|
- You can use the new functionality to format multiple values together by
calling the new
format
method of the str
type
Python 3.6
added interpolated format strings – f strings for short –
to solve most of the problems associated with displaying formatted strings
- Python expression may also appear within the format specifier options
1
2
3
|
key = "rk"
value = "45"
return f'{key}={value}'
|
1
2
3
|
key = "rk"
value = 45.12
return f'{key:<10}={value:.1f}'
|
Takeaways
- C-style format strings that use the
%
operator suffer from a variety of
gotchas and verbosity problems
- The
str.format
introduces some useful concepts in its formatting specifiers
mini language, but it otherwise repeats the mistakes of C-style format strings
and should be avoided
- F-strings are a new syntax for formatting values into strings that solves the
biggest problems with C-style format strings
- F-string are succinct yet powerful because they allow for arbitrary Python
expressions to be directly embedded within the format specifiers
Write Helper functions instead of Complex expressions
1
2
3
|
from urllib.parse import parse_qs
my_values = parse_qs('red=5&blue=10')
return my_values
|
Python’s syntax makes it easy to write single-line expressions that are overly
complicated and difficult to read. Hence it is better if you move such
complicated expressions to helper functions
Prefer Multiple Assignment Unpacking over Indexing
Unpacking has less visual noise than accessing the tuple’s indexes and it often
requires fewer lines.
1
2
3
4
5
6
7
8
9
|
books_to_read = [
("R", "Resampling"),
("Python", "Effective Python"),
("Finance", "Factor Models in R"),
]
for i, (sub, book) in enumerate(books_to_read, 1):
print(f"{i}: Subject {sub} Book{book}")
|
1
2
3
|
1: Subject R BookResampling
2: Subject Python BookEffective Python
3: Subject Finance BookFactor Models in R
|
Unpacking is generalized in Python and can be applied to any iterable, including
many levels of iterables within iterables.
Prefer enumerate over range
<<
operator is zero fill left shift operator
>>
operator is zero fill right shift operator
|
operator is OR operator
enumerate
provides a concise syntax for looping over an iterator and getting
the index of each item from the iterator as you go
- Prefer
enumerate
instead of looping over a range and indexing in to a sequence
- You can supply a second parameter to enumerate to specify the number from
which to begin counting
Use zip
to process iterators in parallel
- Beware of the situation where the iterators are not of equal length. It yields
tuples until any one of the wrapped iterators is exhausted
- one can also use
zip_longest
for the case where iterators of varying lengths
Avoid else
Blocks After for
and while
loops
else
block runs immediately after the loop finishes
- the
else
block runs only if the loop body did not encounter a break
statement
- Avoid using else blocks after loops because their behavior isn’t intuitive and
can be confusing
Prevent Repetition with Assignment Expressions
- An assignment expression - also known as walrus operator- is a new syntax
introduced in Python 3.8 to solve a long-standing problem with the language.
It is written as
1
2
3
4
5
6
7
8
9
10
11
12
13
|
fresh_fruit = {
'apple':10, 'banana':8, 'lemon':5
}
if count:= fresh_fruit.get('lemon',0):
print('Yes lemon')
else:
print('No lemon')
if (count:= fresh_fruit.get('lemon',0)) > 4:
print('Yes Cider')
else:
print('No Cider')
|
- walrus operator can also be used a substitute for deeply nested
if,elif,else
statements.
- walrus operator can also be used to eliminate loop-and-a-half idiom
- Although switch-case statements and do-while loops are not available in
Python
, their functionality can be emulated much more clearly by using
assignment expressions
Lists and Dictionaries
Know How to Slice Sequences
- When slicing from a start of a list, you should leave out the zero index to
reduce visual noise
- When slicing to the end of a list, you should leave out the final index
because it is redundant
- The result of slicing a
list
is a whole new list.
- Assigning to a
list
slice replaces that range in the original sequence with
what’s referenced even if the lengths are different
Avoid Striding and Slicing in a Single Expression
- Specifying start, end and stride in a slice can be extremely confusing
- Prefer using positive stride values in slices without start or end indexes.
Avoid negative stride values if possible
- Avoid using start, end and stride together in a single slice. If you need all
three parameters, consider doing two assignments
Prefer Catch-All Unpacking over Slicing
1
2
3
|
x = list(range(10))
a, b, *c = x
return f"a:{a}, b:{b}, c:{c}"
|
1
|
a:0, b:1, c:[2, 3, 4, 5, 6, 7, 8, 9]
|
- Starred expressions may appear in any position, and they will always become a
list
containing the zero or more values they receive
- When dividing a
list
in to non-overlapping pieces, catch-all unpacking is
much less error prone than slicing and indexing
Sort by Complex Criteria Using the key
parameter
- Sorting arbitrary python objects in a list works by invoking the relevant
comparison methods on the object. If the object does not implement the
comparison operator, then there is a syntax error
1
2
3
4
5
6
7
8
9
10
11
|
class Tool:
def __init__(self, name, weight):
self.name = name
self.weight = weight
<span class="k">def</span> <span class="fm">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="n">f</span><span class="s2">"Tool({self.name!r}, {self.weight})"</span>
tools = [Tool("level", 2), Tool("axe", 21)]
return sorted(tools, key=lambda x: x.name), sorted(tools, key=lambda x: x.weight)
|
Tool |
(axe 21) |
Tool |
(level 2) |
Tool |
(level 2) |
Tool |
(axe 21) |
- Tuples are comparable by default and have a natural ordering
- Returning a
tuple
from the key function allows you to combine multiple
sorting criteria together. The unary minus operator can be used to reverse
individual sort orders for types that allow it
- For types that can’t be negated, you can combine many sorting criteria
together by calling the
sort
method multiple times using different key
functions and reverse
values, in the order of lowest rank sort
call to
highest rank sort
call
Be Cautious When Relying on dict
insertion ordering
- In Python 3.5 and before, iterating over a
dict
would return keys in an
arbitrary order. This happened because the dictionary type previously
implemented its hash table algorithm with an combination of the hash built-in
function and a random seed that was assigned when the Python interpreter started
- Starting with Python 3.6 and officially part of Python spec in version 3.7,
dictionaries preserve the insertion order
Prefer get
Over in
and KeyError
to Handle missing Dictionary Keys
- There are four common ways to detect and handle missing keys in dictionaries:
using
in
expressions, KeyError
exceptions, the get
method and
setdefault
method
- The
get
method is best for dictionaries that contain basic types like
counters, and it is preferable along with assignment expressions when creating
dictionary values has a high cost or may raise exceptions
setdefault
tries to fetch the value of a key in the dictionary. If the key
isn’t present, the method assigns that key to the default value provided.
- When the
setdefault
method of dict
seems like the best fit for your
problem, you should consider using defaultdict
method
Prefer defaultdict
Over setdefault
to handle missing items in internal state
- If you are creating a dictionary to mange an as arbitrary set of potential keys,
then you should prefer using a
defaultdict
instance from the collections
built-in module if it suits your problem
- If a dictionary of arbitrary keys is passed to you, and you don’t control its
creation, then you should prefer the
get
method to access its items.
However, it’s worth considering using the setdefault
method for a few
situations in which it leads to shorter code
Know how to construct key-dependent default values using __missing__
- The
setdefault
method of dict
is a bad fit when creating the default value
has high computational cost
- The function passed to
defaultdict
must not require any arguments, which
makes it impossible to have the default value depend on the key being accessed
- You can define your own
dict
subclass with a __missing__
method in order
to construct default values that must know which key was being accessed
Functions
Never Unpack More than three variables when functions return multiple values
- You can have functions return multiple values by putting them in a
tuple
and
having the caller take advantage of Python’s unpacking syntax
- Multiple return values from a function can also be unpacked by catch-all
starred expressions
- Unpacking into four or more variables is error prone and should be avoided.
One can use a
namedtuple
instance
Prefer Raising Exceptions to Returning None
- Functions that return
None
to indicate special meaning are error prone
because None
and other values all evaluate to False
in conditional expectations
- Raise exceptions to indicate special situations instead of returning
None
.
- Type annotations can be used to make it clear that a function will never
return the value
None
, even in special situations
Know How closures interact with Variable scope
- Python supports closures - that is, functions that refer to variables from
the scope in which they were defined
- Python has specific rules for comparing sequences. It first compares items at
index zero; if they are still equal, it compares items at index two, and so on
- When you reference a variable in an expression, the Python interpreter
traverses the scope to resolve the reference in this order
- the current function’s scope
- any enclosing scopes
- the scope of the module that contains the code
- the built-in scope
- Assigning a value to a variable works differently. If the variable is already
defined in the current scope, it will just take on the new value. If the
variable doesn’t exist in the current scope, Python treats the assignment as a
variable definition. Critically, the scope of the newly defined variable is the
function that contains the assignment
- There is a special syntax for getting data out of closure. The
nonlocal
statement is used to indicate the scope traversal should happen upon
assignment for a specific variable name.
- avoid using
nonlocal
statements for anything beyond simple functions
- use the
nonlocal
statement to indicate when a closure can modify a variable
in its enclosing scope
- By default. closures can’t affect enclosing scopes by assigning variables
Reduce Visual Noise with variable positional arguments
- Optional positional arguments are always turned into a
tuple
before they are
passed to a function
- functions that accept *args are best for situations where you know the number
of inputs in the argument list will be reasonably small
- Using the * operator with a generator may cause a program to run out of memory
and crash
Provide Optional Behavior with Keyword Arguments
- Positional arguments must be specified before keyword arguments
- Function arguments can be specified by position or by keyword
- Keywords make it clear what the purpose of each argument is when it would be
confusing with only positional arguments
- Keyword arguments with default values make it easy to add new behaviors to a
function without needing to migrate all existing callers
- Optional keyword arguments should always be passed by keyword instead of by
position
Use None and Docstrings to specify dynamic default arguments
- A default argument value is evaluated only once per module load, which usually
happens when a program starts up. After the module containing this code is
loaded, the datetime.now() default argument will never be evaluated again
- Use
None
as the default value for any keyword argument that has a dynamic
value. Document the default behavior using the function’s docstring
Enforce clarity with Keyword only arguments and Positional arguments
- Keyword-only arguments force callers to supply certain arguments by keyword,
which makes the intention of the function call clearer. Keyword-only arguments
are defined after a single * in the argument list
- Positional-only arguments ensure that callers can’t supply certain parameters
using keywords, which help reduce coupling
- Parameters between the
/
and *
characters in the argument list may be
supplied by position or keyword
- Decorators in Python are syntax to allow one function to modify another
function at runtime
- Using decorators can cause strange behaviors in tools that do introspection
- Use the
wraps
decorator from the functools
built-in module when you define
your decorators to avoid issues
Comprehensions and Generators
Use comprehensions instead of map
and filter
- List comprehensions are cleaner than the
map
and filter
built-in functions
because they don’t require lambda
expressions
1
2
3
4
|
data = list(range(10))
x1 =[x*2 for x in data if x%2 ==0]
x2 = list(map(lambda x : x*2, filter(lambda x: x%2==0, data)))
x1==x2
|
- List comprehensions allow you to easily skip items from the input list, a
behavior that
map
doesn’t support without help of a filter
- Dictionaries and sets can also be created using comprehensions
Avoid more than two control subexpressions in comprehensions
- comprehensions support multiple
if
conditions. multiple conditions at the
same loop level have and implicit and
expression
- comprehensions support multiple levels and multiple conditions per loop level
Avoid repeated work in comprehensions by using Assignment expressions
- if a comprehension uses the walrus operator in the value part of the
comprehension and doesn’t have a condition, it ’ll leak the loop variable in
the containing scope
- Assignment expressions make it possible for comprehensions and generator
expressions to reuse the value from one condition elsewhere in the same
comprehension, which can improve readability and performance
Consider Generators Instead of Returning Lists
- Using generators can be clearer than the alternative of having a function
return a
list
of accumulated results
- The iterator returned by a generator produces the set of values passed to
yield
expressions within the generator function’s body
- Generators can produce a sequence of outputs for arbitrarily large inputs
because their working memory doesn’t include all inputs and outputs
Be Defensive when iterating over arguments
- The iterator protocol is how python
for
loops and related expressions
traverse the contents of a container type. When Python sees a statement like
for x in foo
, it actually calls iter(foo)
. The iter
built-in function
calls the foo.__iter__
special method in turn. The __iter__
method must
return an iterator object. Then , the for
loop repeatedly calls the next
built-in function on the iterator object until its exhausted
- when an iterator is passed to the
iter
built-in function, iter
returns the
iterator itself
- when a container type is passed to
iter
, a new iterator object is returned
each time
- Beware of functions and methods that iterate over input arguments multiple
times. If these arguments are iterators, you may see strange behavior and
missing values
- Python’s iterator protocol defines how containers and iterators interact with
iter
and next
built-in functions, for loops and related expressions
- You can easily define your own iterable container type by implementing the
__iter__
method as generator
- You can detect that a value is an iterator if called
iter
on it produces the
same value as what you passed in,
Consider Generator Expressions for Large List Comprehensions
- List comprehensions can cause problems for large inputs by using too much
memory.
- Generator expressions avoid memory issues by producing outputs one at a time
as iterators
- Generator expressions can be composed by passing the iterator from one
generator expression into the
for
subexpression of another
- Generator expressions execute very quickly when chained together and are
memory efficient
Compose Multiple Generators with yield from
- The
yield from
expression allows you to compose multiple nested generators
together into a single combined generator
yield from
provides a better performance than manually iterating nested
generators and yielding their outputs
Avoid Injecting Data into Generators with send
- Python generators support the
send
method, which upgrades yield
expressions into a two-way channel. The send
method can be used to provide
streaming inputs to a generator at the same time it’s yielding outputs.
- The
send
method can be used to inject data into a generator by giving the
yield
expression a value that can be assigned to a variable
- using
send
with yield from
expressions may cause surprising behavior, such
as None
values appearing at unexpected times in the generator output
- Providing an input iterator to a set of composed generators is a better
approach than using the
send
method
Avoid Causing State Transitions in Generators with throw
- The way
throw
works is simple: When the method is called, the next
occurrence of a yield
expression re-raises the provided Exception
instance
after its output is received instead of continuing normally
- The
throw
method can be used to re-raise exceptions within generators at the
position of the most recently executed yield
expression
- use
chain
to combine multiple iterators into a single sequential iterator
- use
repeat
to output a single value forever
- use
cycle
to repeat an iterator’s items forever
- use
tee
to split a single iterator into a number of parallel iterators
- use
islice
to slice an iterator by numerical indexes without copying
- use
takewhile
and dropwhile
to filter iterator values
accumulate
folds an item from an iterator into a running value by applying a
function that takes two parameters
product
returns the cartesian product of items from one or more iterators
permutations
returns the unique ordered permutations of length N with items
from an iterator
- The
itertools
functions fall in to three main categories for working with
iterators and generators - linking iterators together, filtering items they
output, and producing combination of items
Classes and Interfaces
Compose classes instead of nesting many levels of built-in types
- Avoid making dictionaries with values that are dictionaries, long tuples or
complex nestings of other built-in types
- Use
namedtuple
for lightweight, immutable data containers before you need
the flexibility of a full class
- Move your bookkeeping code to using multiple classes when your internal state
dictionaries get complicated
- Although a namedtuple is useful in many circumstances, it’s important to
understand when it can do more harm than good:
- You can’t specify default argument values for the
namedtuple
classes. This
makes them unwieldy when your data may have many optional properties
- The attribute values of
namedtuple
instances are still accessible using
numerical indexes and iteration
Accept Functions Instead of Classed for Simple Interfaces
- Instead of defining and instantiating classes, you can often simply use
functions for simple interfaces between components in Python
- References to functions and methods in Python are first class meaning they can
be used in expressions
- The
__call__
special method enables instances of a class to be called like
plain Python functions
- When you need a function to maintain state, consider defining class that
provides the
__call__
method instead of defining a stateful closure
Use @classmethod Polymorphism to Construct Objects Generically
- Polymorphism enables multiple classes in a hierarchy to implement their own
unique versions of a method. This means that many classes can fulfill the same
interface or abstract base class while providing different functionality
- Use
@classmethod
to define alternative constructor for your classes
- Use class method polymorphism to provide generic ways to build and connect
many concrete subclasses
- https://realpython.com/courses/threading-python/
Initialize Parent Classes with super
- Python’s standard method resolution order (MRO) solves the problems of
superclass initialization order and diamond inhertiance
- Use the
super
built-in function with zero arguments to initialize parent
classes
Metaclass
lets you intercept Python’s class statement and provide special
behavior each time a class is defined
Use Plain attributes instead of Setter and Getter methods
- In Python, you never need to implement explicit setter or getter methods.
property()
is a built-in function that creates and returns a property
object. The syntax of this function is
1
|
property(fget=None,fset=None, fdel=None, doc=None)
|
where
fget
is the function to get the value of the attribute
fset
is the function to set the value of the attribute
fdel
is the function to delete the attribute
doc
is a string
- Define new alss interfaces using simple public attributes and avoid defining
setter and getter methods
- Use @property to define special behavior when attributes are accessed on your
objects, if necessary
- Follow the rule of least surprise and avoid odd side effects in your
@property
methods
- Ensure the
@property
methods are fast; for slow or complex work - especially
involving I/0 or causing side effects - use normal methods instead
Consider @property
instead of refactoring attributes
- Use
@property
to give existing instance attributes new functionality
- Make incremental progress towards better data models by using
@property
- Consider refactoring a class and all call sites when you find yourself using
@property
too heavily
Use Descriptors for Reusable @property
methods
- The big problem with
@property
built-in is reuse. The methods it decorates
can’t be reused for multiple attributes of the same class.
- The descriptor protocol defines how attribute access is interpreted by the
language. Descriptor class can provide get and set methods that let you reuse
any validation logic without boiler plate
weakref
module: This module provides a special class called
WeakKeyDictionary
that can take the place of simple dictionary. The unique
behavior of WeakKeydictionary is Python does the bookkeeping and the
dictionary will be empty when all the keys are no longer in use
- Reuse the behavior and validation of @property methods by defining your own
descriptor classes
- Use WeakKeyDictionary to ensure that the descriptor classes don’t cause memory
leaks
- Don’t get bogged won trying to understand exactly how
__getattribute__
uses
the descriptor protocol for getting and setting attributes
Use __getattr__, __getattribute__, setattr__
for Lazy attributes
Learned that it is important to pay attention if your classes have an
implementation of __getattribute__
- Use
__getattr__
and __setattr__
to lazily load and save attributes for an
object
- Understand that
__getattr__
only gets called when accessing a missing
attribute, whereas __getattribute__
gets called every time any attribute is
accessed
- Avoid infinite recursion in
__get_attribute__
and __setattr__
by using
methods from super() to access instance attributes
Validate Subclasses with __init__subclass__
- A metaclass is defined by inheriting from
type
- A metaclass receives the contents of the associated
class
statements in its
__new__
method
- The metaclass has access to the name of the class, the parent classes it
inherits from and all the class attributes that are defined in the class body
- Python 3.6 introduced a simplified syntax
__init__subclass__
that can be
used to validate the object hierarchy
Register Class Existence with __init_subclass__
- Class registration is a helpful pattern for building modular Python programs
- Metaclasses let you run registration code automatically each time a base
class is subclassed in a program
- Using metaclasses for class registration helps you avoid errors by ensuring
that you never miss a registration call
- Prefer
__init_subclass__
over standard metaclass machinery because it’s
clearer and easier for beginners to understand
Concurrency and Parallelism
Use subprocess
to manage child processes
- Python has many ways to run subprocesses, but the best choice for managing
child processes is to to use the
subprocess
built-in module
- Child processes run in parallel with the Python interpreter, enabling you to
maximize your usage of CPU cores
- Use the
run
convenience function for simple usage, and the Popen
class for
advanced usage like UNIX-style pipelines
- Use the
timeout
parameter of the communicate
method to avoid dead-locks
and hanging child processes
Use threads for blocking I/O, Avoid for Parallelism
Because of the way CPython works, threading may not speed up all tasks. This is
due to interactions with the GIL that essentially limit one Python thread to run
at a time
- The standard implementation of Python is called
CPython
. CPython runs a
Python program in two steps. First it parses and compiles the source text into
bytecode
, which is a low-level representation of the program. Then, CPython
runs the bytecode using a stack-based interpreter. The bytecode interpreter
has state that must be maintained and coherent while the program executes.
CPython enforces coherence with GIL
- GIL is a mutex that prevents CPython from being affected by preemptive
multithreading, where one thread takes control of a program by interrupting
another thread.
- Why does Python supports thread at all ?
- Multiple threads make it easy for a program to seem like it’s doing multiple
things at the same time. Managing the juggling act of simultaneous tasks is
difficult to implement yourself. With threads, you can leave it to Python to
run your function concurrently
- Helps in dealing with blocking I/O which happens when Python does certain
types of system calls
- All system calls will run in parallel from multiple Python threads even though
they are limited by the GIL. The GIL prevents my Python code from running in
parallel but it doesn’t have an effect on system calls. This works because
Python threads release the GIL just before they make system calls, and they
reacquire the GIL as soon as the system calls are done
- Use Python threads to make multiple system calls in parallel. This allows you
to do blocking I/O at the same time as the computation
Use Lock
to prevent data races in threads
- Although only one Python thread runs at a time, a thread’s operations on data
structures can be interrupted between any two byte code instructions in the
Python interpreter
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
from threading import Thread
class Counter:
def init(self):
self.count = 0
<span class="k">def</span> <span class="nf">increment</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">offset</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">count</span> <span class="o">+=</span> <span class="n">offset</span>
def worker(sensor_index, how_many, counter):
for _ in range(how_many):
counter.increment(1)
how_many = 10 ** 5
counter = Counter()
threads = []
for i in range(5):
thread = Thread(target=worker, args=(i, how_many, counter))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
expected = how_many * 5
found = counter.count
print(f"Counter should be {expected}, got {found}")
|
1
|
Counter should be 500000, got 374258
|
The python interpreter enforces fairness between all of the threads that are
executing to ensure they get roughly equal processing time. To do this, Python
suspends a thread as it’s running and resumes another thread in turn. The
problem is that you don’t know exactly when Python will suspend your threads. A
thread can even be paused seemingly halfway through what looks like an atomic
operation
The above program can be easily modified with the help of Lock
to get the
desired output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
from threading import Thread
from threading import Lock
class Counter:
def init(self):
self.count = 0
self.lock = Lock()
<span class="k">def</span> <span class="nf">increment</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">offset</span><span class="p">):</span>
<span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">lock</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">count</span> <span class="o">+=</span> <span class="n">offset</span>
def worker(sensor_index, how_many, counter):
for _ in range(how_many):
counter.increment(1)
how_many = 10 ** 5
counter = Counter()
threads = []
for i in range(5):
thread = Thread(target=worker, args=(i, how_many, counter))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
expected = how_many * 5
found = counter.count
print(f"Counter should be {expected}, got {found}")
|
1
|
Counter should be 500000, got 500000
|
Use Queue
to Coordinate Work Between Threads
- Pipelines are a great way to organize sequences of work - especially I/O bound
programs - that run concurrently using multiple Python threads
- Be aware of the many problems in building concurrent pipelines: busy waiting,
how to tell workers to stop, and potential memory explosion
- The
Queue
class has all the facilities you need to build robust pipelines:
blocking operations, buffer sizes and joining
Know How to Recognize When Concurrency is Necessary
- A program often grows to require multiple concurrent lines of execution as its
scope and complexity increases
- The most common types of concurrency coordination are fan-out(generating new
units of concurrency) and fan-in (waiting for existing units of concurrency to
complete)
- Python has many different ways of achieving fan-out and fan-in
Avoid Creating New Thread
Instances for On-demand Fan-out
- The
Thread
instances require special tools to coordinate with each other
safely. This makes the code that uses threads harder to reason than the
procedural, single-threaded code from before. This complexity makes threaded
code ore difficult to extend and maintain over time
- Threads require a lot of memory - about 8 MB per executing thread. On many
computers, that amount of memory doesn’t matter for let’s say 100 threads. But
if you span 10000 threads, then it is a issue as you would need 80GB of memory
- Starting a thread is costly, and threads have a negative performance impact
when they run due to context switching between them. In this case, all of the
threads are started and stopped each generation of the game, which has high
overhead and will increase latency beyond the expected I/O time
Thread
class will independently catch any exceptions that are raised by the
target function and then write their traceback to sys.stderr
. Such
exceptions are never re-reraised to the caller that started the thread in the
first place
- Threads have many downsides. They’re costly to start and run if you need a lot
of them, they each require a significant amount of memory, and they require
special tools like
Lock
instances for coordination
- Threads do not provide a built-in way to raise exceptions back in the code
that started a thread or that is waiting for one to finish which makes them
difficult to debug.
Understand How Using Queue for Concurrency Requires Refactoring
- Using
Queue
instances with a fixed number of worker threads improves the
scalability of fan-out and fan-in using threads.
- It takes a significant amount of work to refactor existing code to use
Queue
, especially when multiple stages of a pipeline are required
- Using
Queue
fundamentally limits the total amount of I/O parallelism a
program can leverage compared to alternative approaches provided by other
built-in Python features and modules
Consider ThreadPoolExecutor
when threads are necessary for concurrency
- Python include
concurrent.futures
built-in module, which provides the
ThreadPoolExecutor
class. It combines the best of Thread
and Queue
- The threads used for the executor can be allocated in advance, which means
three is no startup cost for each execution
ThreadPoolExecutor
automatically propagates exceptions back to the caller
- The big problem with using
ThreadPoolExecutor
is that it won’t be able to
scale
- Although
ThreadPoolExecutor
eliminates the potential memory blow-up issues
of using threads, it also limits I/O parallelism by requiring max_workers
to be specified upfront
ThreadPoolExecutor
enables simple I/O parallelism with limited refactoring,
easily avoiding the cost of thread startup each time fan out concurrency is
required
Achieve Highly Concurrent I/O with Coroutines
- Python addresses the need for highly concurrent I/O with coroutines.
Coroutines let you have a very large number of seemingly simultaneous
functions in your Python programs.
- The cost of starting a coroutine is a function call. Once a coroutine is
active, it uses less than 1 KB of memory until it’s exhausted
- Like threads, coroutines are independent functions that can consume inputs
from their environment and produce resulting outputs. The difference is that
coroutines pause at each
await
expression and resume executing an async
function after the pending awaitable
is resolved
- The magic mechanism powering coroutines is the event loop, which can do highly
concurrent I/O efficiently, while rapidly interleaving execution between
appropriately written functions
The beauty of coroutines is that they decouple your code’s instructions for
the external environments from the implementation that carries out your
wishes.
- Coroutines can use fan-out and fan-in in order to parallelize I/O while also
overcoming all the problems associated with doing I/O in threads
Know how to port threaded I/O to asyncio
- Python’s support for asynchronous execution is well integrated in to the language
- Python provides asynchronous versions of
for
loops, with
statements,
generators, comprehensions and library helper functions that can be used as
drop-in replacements in coroutines
- The
asyncio
built-in module makes it straightforward to port existing code
that uses threads and blocking I/O over to coroutines and asynchronous I/O
Consider concurrent.futures
for True Parallelism
- It enables Python to utilize multiple CPU cores in parallel by running
additional interpreters as child processes. These child processes are separate
from the main interpreter, so their global interpreter locks are also
separate. Each child can fully utilize one CPU core. Each child has a link to
the main process where it receives instructions to do computation and returns
results
- What does
ProcessPoolExecutor
do ?
- It takes each item from the args list
- It serializes the item in to a binary data using
pickle
module
- It copies the serialized data from the main interpreter process to a child
interpreter process over a local socket
- It deserializes the data back into Python objects, using
pickle
in the
child process
- It imports the Python module containing the relevant function
- It runs the function on the input data in parallel with other child
processes
- It serializes the results back into binary data
- It copies the binary data back through the socket
- It deserializes the binary data back into Python objects in the parent process
- It merges the results from multiple children
- Moving CPU bottlenexts to C-extension modules can be an effective way to
improve performance while maximizing your investment in Python code
- The
multiprocessing
module provides powerful tools that can parallelize
certain types of Python computation with minimal effort
- The power of
multiprocessing
is best accessed through the
concurrent.futures
built-in module
- Avoid the advanced parts of
multiprocessing
module until you have exhausted
all other options
Take Advantage of Each Block in try/except/else/finally/
- use
try/finally
when you want exceptions to propagate up but also want yo
run up cleanup code even when exceptions occur
- use
try/except/else
to make it clear which exceptions will be handled by
your code and which exceptions will propagate up
- Use
try/except/else/finally
when you want to do it all in one compound
statement. For example, say that I want to read a description of work to do
from a file, process it, and then update the file in-place. The try
block is
used to read the file and process it; the except
block is used to handle
exceptions from the try
block that are expected; the else
block is used to
update the file in place and allow related exceptions to propagate up; and the
finally
block cleans up the file handle
- The
else
block helps you minimize the amount of code in try=blocks and visually distinguish the success case from the =try/except
blocks
- An
else
block can be used to perform additional actions after a successful
try block but before common cleanup in a finally
block
Consider contextlib
and with
Statements for Reusable try/finally
Behavior
- the
with
statement in Python is used to indicate when code is running in a
special context.
- It is easy to make your objects and functions work in
with
statements using
the contextlib
built-in module. This module contains the contextmanager
decorator which lets a simple function be used in with
statements. This is
much easier than defining a new class with special methods __enter__
and
__exit__
- The context manager passed to a
with
statement may also return an object.
The object is assigned to a local variable in the as
part of the compound
statement
- The value yielded by context managers is supplied to the
as
part of the
with
statement. It is useful for letting your code directly access the cause
of a special context
- The
contextlib
built-in module provides a contextmanager
decorator that
makes it easy to use your own functions in with
statements
Use datetime
instead of time
for Local clocks
- the
time
module fails to consider work properly for multiple local times.
Thus, you should avoid using the time
module for this purpose. If you must
use time
, use it only to convert between UTC and the host computer’s local
time.
datetime
only provides the machinery for time zone operations with its
tzinfo
class and related methods. The Python default installation is missing
time zone definitions beside UTC
- To use
pytz
effectively, you should always convert local times to UTC first.
Perform any datetime
operations you need on the UTC values. Then convert to
local times as a final step
- Always represent time in UTC and do conversions to local time as the very
final step before presentation
Make pickle
reliable with copyreg
- The purpose of
pickle
is to let you pass Python objects between programs
that you control over binary channels
- If you serialize, deserialize and then serialize again making changes to the
classes, there will be inconsistency between previous serialized objects and
the most recently serialized objects
- Deserializing previously pickled objects may bread if the classes involved ave
changed over time
- The
copyreg
module lets you register the functions responsible for
serializing and deserializing python objects, allowing you to control the
behavior of pickle
and make it more reliable
- Use the
copyreg
built-in module with pickle
to ensure backward
compatibility of serialized objects
Use decimal
when precision is paramount
- The
Decimal
class from the decimal
built-in module provides fixed point
math of 28 decimal places by default
- The
Decimal
class is ideal for situations that require high precision and
control over rounding behavior, such as computations of monetary values
- Pass
str
instances to the Decimal
constructor instead of float
instances
if it’s important to compute exact answers and not floating point
approximations
Profile before optimizing
- Python provides a built-in profiler for determining which parts of a program
are responsible for its execution time. This means you can focus your
optimization efforts on the biggest sources of trouble and ignore parts of the
program that don’t impact speed
- Python provides two built-in profilers: one that is pure Python and another
that is a C-extension module. The
cProfile
built-in module is better because
of its minimal impact on the performance of your program while its being
profiled
- The
Profile
object’s runcall
method provides everything you need to
profile a tree of function calls in isolation
- The
Stats
object lets you select and print the subset of profiling
information you need to see to understand your program’s performance.
Prefer deque
for Producer-Consumer Queues
- the
list
type can be used as a FIFO queue by having the producer call
append
to add items and the consumer call pop(0)
to receive items. However,
this may cause problems because the performance of pop(0)
degrades
superlinearly as the queue length increases.
- The
deque
class from the collections
built-in module takes constant
time - regardless of length - for append
and popleft
, making it ideal for
FIFO queues.
Consider Searching Sorted Sequences with bisect
- Searching sorted data contained in a
list
takes linear time using the
index
method or a for
loop with simple comparisons
- The
bisect
built-in module’s bisect-left
function takes logarithmic time
to search for values in sorted lists
, which can be orders of magnitude
faster than other approaches
Know How to Use heapq
for Priority Queues
Testing and Debugging
Consider Interactive Debugging with pdb
- In most other programming languages, you use a debugger by specifying what
line of a source file you would like to stop on, and then execute the program.
In contrast, with Python, the easiest way to use the debugger is by modifying
your program to directly initiate the debugger just before you think you’ll
have an issue worth investigating
- Three very useful commands make inspecting the running program easier
- When you are done inspecting the current state, you can use these five
debugger commands to control the program’s execution
step
next
return
continue
quit
- The Python debugger prompt is a full Python shell that lets you inspect and
modify the state of a running program
Use tracemalloc
to understand memory usage and leaks
- Memory management in the default implementation of Python, CPython, uses
reference counting. This ensures that as soon as all references to an object
have expired, the reference object is also cleared from memory, freeing up
that space for other data. CPython also has a built-in cycle detector to
ensure that self-referencing objects are eventually garbage collected. In
theory, this means the most Python programmers don’t have to worry about
allocating or deallocating memory in their programs
- One of the first ways to debug memory usage is to ask the
gc
built-in module
to list every object currently known by the garbage collector.
- It can be difficult to understand how Python programs use and leak memory
- The
gc
module can help you understand which objects exist, but it has no
information about how they were allocated
- The
tracemalloc
built-in module provides powerful tools for understanding
the sources of memory usage
Collaboration
- The Python Package Index contains a wealth of common packages that are built
and maintained by the Python community
pip
is the command line tool you can use to install packages from PyPI
- The majority of PyPI modules are free and open source software
Use Virtual environments for isolated and reproducible environments
- Virtual environments allow you to use
pip
to install many different versions
of the same package on the same machine without conflicts
- Virtual environment are created with
python -m venv
, enabled with source bin/activate
and disabled with deactivate
- You can dump all the requirements of an environment with
python3 -m pip freeze
- You can reproduce an environment by running
python3 -m pip install -r requirements.txt
Write Docstrings for every function, class and Module
- Documentation in Python is extremely important because of the dynamic nature
of the language. Python provides built-in support for attaching documentation
to blocks of code. Unlike with many other languages, the documentation from
the program’s source code is directly accessible as the program runs
- You can use the built-in
pydoc
module from the command line to run a local
web server that hosts all the Python documentation that’s accessible to your
interpreter
- Each module should have a top-level docstring - a string literal that is the
first statement in the source file. The goal of this doc string is to
introduce the module and its contents
- If you are using type annotations, omit the information that’s already present
in type annotations from docstrings since it would be redundant to have it in
both places
- For functions and methods: Document every argument, returned value, raised
exception and other behaviors in the docstring following the
def
statement
- For classes: Document behavior, important attributes, and subclass behavior in
the docstring following the
class
statement
Use Packages to Organize Modules and Provide Stable APIs
- Packages in Python are modules that contain other modules. Packages allow you
to organize your code into separate, non-conflicting namespaces with unique
absolute module names
- Simple packages are defined by adding an
__init__.py
to a directory that
contains other source files. These files become the child modules of the
directory’s package. Package directories may also contain other packages
- You can provide an explicitly API for a module by listing its publicly visible
names in its
__all__
special attribute
- You can hide a package’s internal implementation by only importing public
names in the package’s
__init__.py
field or by naming internal-only members
with a leading underscore
- When collaborating within a single team or a single codebase, using
__all__
for explicitly APIs is probably unnecessary
- Programs often need to run in multiple deployment environments that each have
unique assumptions and configurations
- You can tailor a module’s contents to different deployment environments by
using normal Python statements in module scope
- Module contents can be the product of any external condition including host
introspection through the
sys
and os
modules
Define a Root Exception to Insulate Callers from APIs
- root exceptions let callers understand when there’s a problem with their usage
of an API. If callers are using API properly, they should catch the various
exceptions that are deliberated raised
- root exceptions also help in finding bugs
- Intermediate root exceptions let you add more specific types of exceptions in
the future without breaking your API consumers
- Catching the Python
Exception
base class can help you find bugs in API implementations
Know how to break circular dependencies
- When a module is imported, here’s what Python actually does
- Searches for a module in locations from
sys.path
- Loads the code from the module and ensures that it compiles
- Creates a corresponding empty module object
- Inserts the module into
sys.modules
- Runs the code in the module object to define its contents
- The attributes of a module aren’t defined until the code for those attributes
has executed. But the module can be loaded with the
import
statement
immediately after it’s inserted into sys.modules
- Dynamic imports are the simplest solution for breaking a circular dependency
between modules while minimizing refactoring and complexity
Consider warnings
to Refactor and Migrate Usage
- Using
warnings
is a programmatic way to inform other programmers that their
code needs to be modified due to a change to an underlying library that they
depend on. While exceptions are primarily for automated error handling by
machines, warnings are all about communication between humans about what to
expect in their collaboration with each other
warning.warn
also supports the stacklevel
parameter, which makes it
possible to report the correct place in the stack as the cause of the warning.
stacklevel
also makes it easy to write functions that can issue warnings on
behalf of other code, reducing boiler plate.
Consider Static Analysis via typing
to Obviate Bugs – WORK IN PROGRESS
- the benefit of adding type information to a Python program is that you can run
static analysis tools to ingest a program’s source code and identify where
bugs are most likely to occur. The
typing
built-in module doesn’t actually
implement any of the type checking functionality itself. It merely provides a
common library for defining types, including generics, that can be applied to
Python code and consumed by separate tools
- Most popular implementations of typing tools are
mypy
, pytype
,
pyright
, pyre
- There are many new constructs in this chapter that I have never paid attention
to. Infact I have hardly written any code that uses
typing
module to
annotate. I should probably spend some time going over typing
module and
incorporate it in my daily work
- A wide variety of other options are available in the
typing
module. Notably,
exceptions are not included. Exceptions are not considered part of an
interface’s definition. Thus, if you want to verify that you are raising and
catching exceptions properly, you need to write tests
- It’s going to slow you down if you try to use type annotations from the start
when writing a new piece of code. A general strategy is to write a first
version without annotations, then write tests, and then add type information
where it’s most valuable
- Type hints are most important at the boundaries of a codebase such as an API
you provide that many callers depend on. Type hints complement integrations
tests and warnings to ensure that your API callers aren’t surprised or broken
by your changes
- It can be useful to apply type hints to the most complex and error prone parts
of you code that aren’t part of an API
- If possible, you should include static analysis as part of your automated
build and test system to ensure that every commit to your codebase is vetted
for errors. In addition, the configuration used for type checking should be
maintained in the repository to ensure that all of the people you collaborate
with are using the same rules
- As you add type information to your code, it’s important to run type checker
as you go. Otherwise, you may nearly finish sprinkling type hints everywhere
and then be hit by a huge wall of errors from the type checking tool, which
can be disheartening and make you want to abandon type hints altogether
- It’s important that in many situations, you may not need or want to use any
type annotations at all. For small programs, adhoc code, legacy codebases, and
prototypes, type hints may require far more effort than they are worth
- Python has special syntax and the
typing
built-in module for annotating
variables, fields, functions and methods with type information
- Static type checkers can leverage type information to help you avoid many
common bugs that would otherwise happen at runtime
- There are variety of best practices for adopting types in your programs, using
them in APIs, and making sure they don’t get in the way of your productivity.
Takeaway
This book is targeted towards intermediate level Python
developer and can be a
useful reference for writing beautiful code. If you writing throwaway code most
of the time, then probably you can give this book a pass. However if you are
writing or intend to write a piece of code that will be reusable by you or
others, now or in the future, this book can be a valuable reference in writing
effective code.