Programming with Python
Instructor’s Guide
Legend
We are using a dataset with records on inflammation from patients following an arthritis treatment.
We make reference in the lesson that this data is somehow strange. It is strange because it is fabricated! The script used to generate the inflammation data is included as tools/gen_inflammation.py
.
Overall
This lesson is written as an introduction to Python, but its real purpose is to introduce the single most important idea in programming: how to solve problems by building functions, each of which can fit in a programmer’s working memory. In order to teach that, we must teach people a little about the mechanics of manipulating data with lists and file I/O so that their functions can do things they actually care about. Our teaching order tries to show practical uses of every idea as soon as it is introduced; instructors should resist the temptation to explain the “other 90%” of the language as well.
The final example asks them to build a command-line tool that works with the Unix pipe-and-filter model. We do this because it is a useful skill and because it helps learners see that the software they use isn’t magical. Tools like grep
might be more sophisticated than the programs our learners can write at this point in their careers, but it’s crucial they realize this is a difference of scale rather than kind.
Explain that we use Python because:
- It’s free.
- It has a lot of scientific libraries, and more are constantly being added.
- It has a large scientific user community.
- It’s easier for novices to learn than most of the mature alternatives. (Software Carpentry originally used Perl; when we switched, we found that we could cover as much material in two days in Python as we’d covered in three days in Perl, and that retention was higher.)
We do not include instructions on running the Jupyter Notebook in the tutorial because we want to focus on the language rather than the tools. Instructors should, however, walk learners through some basic operations:
- Launch from the command line with
jupyter notebook
. - Create a new notebook.
- Enter code or data in a cell and execute it.
- Explain the difference between
In[#]
andOut[#]
.
Watching the instructor grow programs step by step is as helpful to learners as anything to do with Python. Resist the urge to update a single cell repeatedly (which is what you’d probably do in real life). Instead, clone the previous cell and write the update in the new copy so that learners have a complete record of how the program grew. Once you’ve done this, you can say, “Now why don’t we just break things into small functions right from the start?”
The discussion of command-line scripts assumes that students understand standard I/O and building filters, which are covered in the lesson on the shell.
Frequently Argued Issues (FAI)
import ... as ...
syntax.This syntax is commonly used in the scientific Python community; it is explicitly recommended in documentation to
import numpy as np
andimport matplotlib.pyplot as plt
. Despite that, we have decided not to introduce aliasing imports in this novice lesson due to the additional cognitive load it puts on students, despite the typing that it saves. A good summary of arguments for and against can be found in PR #61.It is up to you as an individual instructor whether you want to introduce these aliases when you teach this lesson, but we encourage you to please read those arguments thoroughly before deciding one way or the other.
NumPy methods.
We used to use NumPy array methods in the first NumPy topic. We switched these methods to the equivalent functions because a majority of instructors supported the change; see PR #244 for detailed arguments for and against the change.
Underscores vs. hyphens in filenames
We used to use hyphens in filenames in order to signify that these Python files should only be run as scripts and never imported. However, after some discussion, including an informal Twitter poll, we switched over to underscores because many files that start off as Python scripts end up being imported eventually. For that reason, we also added
if __name__ == '__main__'
guards aroundmain()
calls, which is how real-world Python scripts ensure that imports do not result in side-effects.
Analyzing Patient Data
Solutions to exercises:
Sorting out references
What does the following program print out?
first, second = 'Grace', 'Hopper'
third, fourth = second, first
print(third, fourth)
Hopper Grace
Slicing strings
A section of an array is called a slice. We can take slices of character strings as well:
element = 'oxygen'
What is the value of element[:4]
? What about element[4:]
? Or element[:]
?
oxyg
en
oxygen
What is element[-1]
? What is element[-2]
?
n
e
Given those answers, explain what element[1:-1]
does.
Creates a substring from index 1 up to (not including) the final index, effectively removing the first and last letters from 'oxygen'
Thin slices
The expression element[3:3]
produces an empty string, i.e., a string that contains no characters. If data
holds our array of patient data, what does data[3:3, 4:4]
produce? What about data[3:3, :]
?
print(data[3:3, 4:4])
print(data[3:3, :])
[]
[]
Check your understanding: plot scaling
Why do all of our plots stop just short of the upper end of our graph? Update your plotting code to automatically set a more appropriate scale (hint: you can make use of the max
and min
methods to help)
Because matplotlib normally sets x and y axes limits to the min and max of our data (depending on data range)
# for example:
axes3.set_ylabel('min')
axes3.plot(numpy.min(data, axis=0))
axes3.set_ylim(0,6)
# or a more automated approach:
min_data = numpy.min(data, axis=0)
axes3.set_ylabel('min')
axes3.plot(min_data)
axes3.set_ylim(numpy.min(min_data), numpy.max(min_data) * 1.1)
Check your understanding: drawing straight lines
Why are the vertical lines in our plot of the minimum inflammation per day not perfectly vertical?
Because matplotlib interpolates (draws a straight line) between the points
Make your own plot
Create a plot showing the standard deviation (numpy.std
) of the inflammation data for each day across all patients.
max_plot = matplotlib.pyplot.plot(numpy.std(data, axis=0))
matplotlib.pyplot.show()
Moving plots around
Modify the program to display the three plots on top of one another instead of side by side.
import numpy
import matplotlib.pyplot
data = numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',')
# change figsize (swap width and height)
fig = matplotlib.pyplot.figure(figsize=(3.0, 10.0))
# change add_subplot (swap first two parameters)
axes1 = fig.add_subplot(3, 1, 1)
axes2 = fig.add_subplot(3, 1, 2)
axes3 = fig.add_subplot(3, 1, 3)
axes1.set_ylabel('average')
axes1.plot(numpy.mean(data, axis=0))
axes2.set_ylabel('max')
axes2.plot(numpy.max(data, axis=0))
axes3.set_ylabel('min')
axes3.plot(numpy.min(data, axis=0))
fig.tight_layout()
matplotlib.pyplot.show()
Repeating Actions with Loops
Solutions to exercises:
From 1 to N
Using range
, write a loop that uses range
to print the first 3 natural numbers.
for i in range(1, 4):
print(i)
1
2
3
Computing powers with loops
Write a loop that calculates the same result as 5 ** 3
using multiplication (and without exponentiation).
result = 1
for i in range(0, 3):
result = result * 5
print(result)
125
Reverse a string
Write a loop that takes a string, and produces a new string with the characters in reverse order.
newstring = ''
oldstring = 'Newton'
length_old = len(oldstring)
for char_index in range(length_old):
newstring = newstring + oldstring[length_old - char_index - 1]
print(newstring)
'notweN'
After discussing these challenges could be a good time to introduce the b *= 2
syntax.
Storing Multiple Values in Lists
Solutions to exercises:
Turn a string into a list
Use a for
loop to convert the string "hello"
into a list of letters:
my_list = []
for char in "hello":
my_list.append(char)
print(my_list)
["h", "e", "l", "l", "o"]
Analyzing Data from Multiple Files
Solutions to exercises:
Plotting Differences
Plot the difference between the average of the first dataset and the average of the second dataset, i.e., the difference between the leftmost plot of the first two figures.
import glob
import numpy
import matplotlib.pyplot
filenames = glob.glob('data/inflammation*.csv')
data0 = numpy.loadtxt(fname=filenames[0], delimiter=',')
data1 = numpy.loadtxt(fname=filenames[1], delimiter=',')
fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))
matplotlib.pyplot.ylabel('Difference in average')
matplotlib.pyplot.plot(data0.mean(axis=0) - data1.mean(axis=0))
fig.tight_layout()
matplotlib.pyplot.show()
Making Choices
Solutions to exercises:
How many paths?
Which of the following would be printed if you were to run this code? Why did you pick this answer?
if 4 > 5:
print('A')
elif 4 == 5:
print('B')
elif 4 < 5:
print('C')
C gets printed, because the first two conditions, 4 > 5
and 4 == 5
are not true, but 4 < 5
is true.
What is truth?
After reading and running the code below, explain the rules for which values are considered true and which are considered false.
if '':
print('empty string is true')
if 'word':
print('word is true')
if []:
print('empty list is true')
if [1, 2, 3]:
print('non-empty list is true')
if 0:
print('zero is true')
if 1:
print('one is true')
First line prints nothing: an empty string is false Second line prints 'word is true'
: a non-empty string is true Third line prints nothing: an empty list is false Fourth line prints 'non-empty list is true'
: a non-empty list is true Fifth line prints nothing: 0 is false Sixth line prints 'one is true'
: 1 is true
Close enough
Write some conditions that print True
if the variable a
is within 10% of the variable b
and False
otherwise.
a = 5
b = 5.1
if abs(a - b) < 0.1 * abs(b):
print('True')
else:
print('False')
Another possible solution:
print(abs(a - b) < 0.1 * abs(b))
This works because the boolean objects True
and False
have string representations which can be print
ed.
In-place operators
Write some code that sums the positive and negative numbers in a list separately, using in-place operators.
positive_sum = 0
negative_sum = 0
test_list = [3, 4, 6, 1, -1, -5, 0, 7, -8]
for num in test_list:
if num > 0:
positive_sum += num
elif num == 0:
pass
else:
negative_sum += num
print(positive_sum, negative_sum)
21 -14
Here pass
means “don’t do anything”. In this particular case, it’s not actually needed, since if num == 0
neither sum needs to change, but it illustrates the use of elif
.
Tuples and exchanges
Explain what the overall effect of this code is:
left = 'L'
right = 'R'
temp = left
left = right
right = temp
The code swaps the contents of the variables right and left.
Compare it to:
left, right = right, left
Do they always do the same thing? Which do you find easier to read?
Yes, although it’s possible the internal implementation is different. Answers will vary on which is easier to read.
Creating Functions
Solutions to exercises:
Combining strings
Write a function called fence
that takes two parameters called original
and wrapper
and returns a new string that has the wrapper character at the beginning and end of the original.
def fence(original, wrapper):
return wrapper + original + wrapper
Selecting characters from strings
Write a function called outer
that returns a string made up of just the first and last characters of its input.
def outer(input_string):
return input_string[0] + input_string[-1]
Rescaling an array
Write a function rescale
that takes an array as input and returns a corresponding array of values scaled to lie in the range 0.0 to 1.0. (Hint: If \(L\) and \(H\) are the lowest and highest values in the original array, then the replacement for a value \(v\) should be \((v-L) / (H-L)\).)
def rescale(input_array):
L = numpy.min(input_array)
H = numpy.max(input_array)
output_array = (input_array - L) / (H - L)
return output_array
Testing and documenting your function
Run the commands help(numpy.arange)
and help(numpy.linspace)
to see how to use these functions to generate regularly-spaced values, then use those values to test your rescale
function. Once you’ve successfully tested your function, add a docstring that explains what it does.
'''Takes an array as input, and returns a corresponding array scaled so
that 0 corresponds to the minimum and 1 to the maximum value of the input array.
Examples:
>>> rescale(numpy.arange(10.0))
array([ 0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
0.55555556, 0.66666667, 0.77777778, 0.88888889, 1. ])
>>> rescale(numpy.linspace(0, 100, 5))
array([ 0. , 0.25, 0.5 , 0.75, 1. ])
'''
Defining defaults
Rewrite the rescale
function so that it scales data to lie between 0.0 and 1.0 by default, but will allow the caller to specify lower and upper bounds if they want. Compare your implementation to your neighbor’s: do the two functions always behave the same way?
def rescale(input_array, low_val=0.0, high_val=1.0):
'''rescales input array values to lie between low_val and high_val'''
L = numpy.min(input_array)
H = numpy.max(input_array)
intermed_array = (input_array - L) / (H - L)
output_array = intermed_array * (high_val - low_val) + low_val
return output_array
Variables inside and outside functions
What does the following piece of code display when run - and why?
f = 0
k = 0
def f2k(f):
k = ((f-32)*(5.0/9.0)) + 273.15
return k
print(f2k(8))
print(f2k(41))
print(f2k(32))
print(k)
259.81666666666666
287.15
273.15
0
k
is 0 because the k
inside the function f2k
doesn’t know about the k
defined outside the function.
Errors and Exceptions
Solutions to exercises:
Reading Error Messages
Read the traceback below, and identify the following pieces of information about it:
- How many levels does the traceback have?
- What is the file name where the error occurred?
- What is the function name where the error occurred?
- On which line number in this function did the error occurr?
- What is the type of error?
- What is the error message?
import errors_02
errors_02.print_friday_message()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-2-e4c4cbafeeb5> in <module>()
1 import errors_02
----> 2 errors_02.print_friday_message()
/Users/jhamrick/project/swc/novice/python/errors_02.py in print_friday_message()
13
14 def print_friday_message():
---> 15 print_message("Friday")
/Users/jhamrick/project/swc/novice/python/errors_02.py in print_message(day)
9 "sunday": "Aw, the weekend is almost over."
10 }
---> 11 print(messages[day])
12
13
KeyError: 'Friday'
- 3 levels
errors_02.py
print_message
- 11
KeyError
- There isn’t really a message; you’re supposed to infer that
Friday
is not a key inmessages
.
Identifying Syntax Errors
- Read the code below, and (without running it) try to identify what the errors are.
- Run the code, and read the error message. Is it a
SyntaxError
or anIndentationError
? - Fix the error.
- Repeat steps 2 and 3, until you have fixed all the errors.
def another_function
print("Syntax errors are annoying.")
print("But at least python tells us about them!")
print("So they are usually not too hard to fix.")
SyntaxError
for missing ():
at end of first line, IndentationError
for mismatch between second and third lines.
Fixed version:
def another_function():
print("Syntax errors are annoying.")
print("But at least python tells us about them!")
print("So they are usually not too hard to fix.")
Identifying Variable Name Errors
- Read the code below, and (without running it) try to identify what the errors are.
- Run the code, and read the error message. What type of
NameError
do you think this is? In other words, is it a string with no quotes, a misspelled variable, or a variable that should have been defined but was not? - Fix the error.
- Repeat steps 2 and 3, until you have fixed all the errors.
for number in range(10):
# use a if the number is a multiple of 3, otherwise use b
if (Number % 3) == 0:
message = message + a
else:
message = message + "b"
print(message)
3 NameError
s for number
being misspelled, for message
not defined, and for a
not being in quotes.
Fixed version:
message = ""
for number in range(10):
# use a if the number is a multiple of 3, otherwise use b
if (number % 3) == 0:
message = message + "a"
else:
message = message + "b"
print(message)
abbabbabba
Identifying Item Errors
- Read the code below, and (without running it) try to identify what the errors are.
- Run the code, and read the error message. What type of error is it?
- Fix the error.
seasons = ['Spring', 'Summer', 'Fall', 'Winter']
print('My favorite season is ', seasons[4])
IndexError
; the last entry is seasons[3]
, so seasons[4]
doesn’t make sense.
Fixed version:
seasons = ['Spring', 'Summer', 'Fall', 'Winter']
print('My favorite season is ', seasons[-1])
Defensive Programming
Solutions to exercises:
Pre- and post-conditions
Suppose you are writing a function called average
that calculates the average of the numbers in a list. What pre-conditions and post-conditions would you write for it? Compare your answer to your neighbor’s: can you think of a function that will pass your tests but not hers or vice versa?
# a possible pre-condition:
assert len(input) > 0, 'List length must be non-zero'
# a possible post-condition:
assert numpy.min(input) < average < numpy.max(input), 'Average should be between min and max of input values'
Testing assertions
Given a sequence of values, the function running
returns a list containing the running totals at each index. Explain in words what the assertions in this function check, and for each one, give an example of input that will make that assertion fail.
def running(values):
assert len(values) > 0
result = [values[0]]
for v in values[1:]:
assert result[-1] >= 0
result.append(result[-1] + v)
assert result[-1] >= result[0]
return result
- The first assertion checks that the input sequence
values
is not empty. An empty sequence such as[]
will make it fail. - The second assertion checks that the first value in the list is positive. Input such as
[-1,0,2,3]
will make it fail. - The third assertion checks that the running total always increases. Input such as
[0,1,3,-5,4]
will make it fail.
Fixing and testing
Fix range_overlap
. Re-run test_range_overlap
after each change you make.
import numpy
def range_overlap(ranges):
'''Return common overlap among a set of [low, high] ranges.'''
if len(ranges) == 1: # only one entry, so return it
return ranges[0]
lowest = -numpy.inf # lowest possible number
highest = numpy.inf # highest possible number
for (low, high) in ranges:
lowest = max(lowest, low)
highest = min(highest, high)
if lowest >= highest: # no overlap
return None
else:
return (lowest, highest)
Debugging
Solutions to exercises:
Debug the following problem
This exercise has the aim of ensuring learners are able to step through unseen code with unexpected output to locate issues. The issues present are that:
The loop is not being utilised correctly.
height
andweight
are always set as the first patient’s data during each iteration of the loop.The height/weight variables are reversed in the function call to
calculate_bmi(...)
Command-Line Programs
Solutions to exercises:
Arithmetic on the command line
Write a command-line program that does addition and subtraction:
$ python arith.py add 1 2
3
$ python arith.py subtract 3 4
-1
# this is code/arith.py
import sys
def main():
assert len(sys.argv) == 4, 'Need exactly 3 arguments'
operator = sys.argv[1]
assert operator in ['add', 'subtract', 'multiply', 'divide'], \
'Operator is not one of add, subtract, multiply, or divide: bailing out'
try:
operand1, operand2 = float(sys.argv[2]), float(sys.argv[3])
except ValueError:
print('cannot convert input to a number: bailing out')
return
do_arithmetic(operand1, operator, operand2)
def do_arithmetic(operand1, operator, operand2):
if operator == 'add':
value = operand1 + operand2
elif operator == 'subtract':
value = operand1 - operand2
elif operator == 'multiply':
value = operand1 * operand2
elif operator == 'divide':
value = operand1 / operand2
print(value)
main()
Finding particular files
Using the glob
module introduced earlier, write a simple version of ls
that shows files in the current directory with a particular suffix. A call to this script should look like this:
$ python my_ls.py py
left.py
right.py
zero.py
# this is code/my_ls.py
import sys
import glob
def main():
'''prints names of all files with sys.argv as suffix'''
assert len(sys.argv) >= 2, 'Argument list cannot be empty'
suffix = sys.argv[1] # NB: behaviour is not as you'd expect if sys.argv[1] is *
glob_input = '*.' + suffix # construct the input
glob_output = sorted(glob.glob(glob_input)) # call the glob function
for item in glob_output: # print the output
print(item)
return
main()
Changing flags
Rewrite readings.py
so that it uses -n
, -m
, and -x
instead of --min
, --mean
, and --max
respectively. Is the code easier to read? Is the program easier to understand?
# this is code/readings_07.py
import sys
import numpy
def main():
script = sys.argv[0]
action = sys.argv[1]
filenames = sys.argv[2:]
assert action in ['-n', '-m', '-x'], \
'Action is not one of -n, -m, or -x: ' + action
if len(filenames) == 0:
process(sys.stdin, action)
else:
for f in filenames:
process(f, action)
def process(filename, action):
data = numpy.loadtxt(filename, delimiter=',')
if action == '-n':
values = numpy.min(data, axis=1)
elif action == '-m':
values = numpy.mean(data, axis=1)
elif action == '-x':
values = numpy.max(data, axis=1)
for m in values:
print(m)
main()
Adding a help message
Separately, modify readings.py
so that if no parameters are given (i.e., no action is specified and no filenames are given), it prints a message explaining how it should be used.
# this is code/readings_08.py
import sys
import numpy
def main():
script = sys.argv[0]
if len(sys.argv) == 1: # no arguments, so print help message
print("""Usage: python readings_08.py action filenames
action must be one of --min --mean --max
if filenames is blank, input is taken from stdin;
otherwise, each filename in the list of arguments is processed in turn""")
return
action = sys.argv[1]
filenames = sys.argv[2:]
assert action in ['--min', '--mean', '--max'], \
'Action is not one of --min, --mean, or --max: ' + action
if len(filenames) == 0:
process(sys.stdin, action)
else:
for f in filenames:
process(f, action)
def process(filename, action):
data = numpy.loadtxt(filename, delimiter=',')
if action == '--min':
values = numpy.min(data, axis=1)
elif action == '--mean':
values = numpy.mean(data, axis=1)
elif action == '--max':
values = numpy.max(data, axis=1)
for m in values:
print(m)
main()
Adding a default action
Separately, modify readings.py
so that if no action is given it displays the means of the data.
# this is code/readings_09.py
import sys
import numpy
def main():
script = sys.argv[0]
action = sys.argv[1]
if action not in ['--min', '--mean', '--max']: # if no action given
action = '--mean' # set a default action, that being mean
filenames = sys.argv[1:] # start the filenames one place earlier in the argv list
else:
filenames = sys.argv[2:]
if len(filenames) == 0:
process(sys.stdin, action)
else:
for f in filenames:
process(f, action)
def process(filename, action):
data = numpy.loadtxt(filename, delimiter=',')
if action == '--min':
values = numpy.min(data, axis=1)
elif action == '--mean':
values = numpy.mean(data, axis=1)
elif action == '--max':
values = numpy.max(data, axis=1)
for m in values:
print(m)
main()
A file-checker
Write a program called check.py
that takes the names of one or more inflammation data files as arguments and checks that all the files have the same number of rows and columns. What is the best way to test your program?
# this is code/check.py
import sys
import numpy
def main():
script = sys.argv[0]
filenames = sys.argv[1:]
if len(filenames) <=1: #nothing to check
print('Only 1 file specified on input')
else:
nrow0, ncol0 = row_col_count(filenames[0])
print('First file %s: %d rows and %d columns' % (filenames[0], nrow0, ncol0))
for f in filenames[1:]:
nrow, ncol = row_col_count(f)
if nrow != nrow0 or ncol != ncol0:
print('File %s does not check: %d rows and %d columns' % (f, nrow, ncol))
else:
print('File %s checks' % f)
return
def row_col_count(filename):
try:
nrow, ncol = numpy.loadtxt(filename, delimiter=',').shape
except ValueError: #get this if file doesn't have same number of rows and columns, or if it has non-numeric content
nrow, ncol = (0, 0)
return nrow, ncol
main()
Counting lines
Write a program called line_count.py
that works like the Unix wc
command:
- If no filenames are given, it reports the number of lines in standard input.
- If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.
# this is code/line_count.py
import sys
def main():
'''print each input filename and the number of lines in it,
and print the sum of the number of lines'''
filenames = sys.argv[1:]
sum_nlines = 0 #initialize counting variable
if len(filenames) == 0: # no filenames, just stdin
sum_nlines = count_file_like(sys.stdin)
print('stdin: %d' % sum_nlines)
else:
for f in filenames:
n = count_file(f)
print('%s %d' % (f, n))
sum_nlines += n
print('total: %d' % sum_nlines)
def count_file(filename):
'''count the number of lines in a file'''
f = open(filename,'r')
nlines = len(f.readlines())
f.close()
return(nlines)
def count_file_like(file_like):
'''count the number of lines in a file-like object (eg stdin)'''
n = 0
for line in file_like:
n = n+1
return n
main()