Chapter 12

Networked Programs

Chapter 12

Python for Informatics: Exploring Information

www.pythonlearn.com

Internet

Client Server

Internet

HTML

CSS

JavaScript

AJAX

HTTP

Request

Response

GET

POST

Python

Templates

Data Store

memcache

socket

Network Architecture....

Transport Control Protocol (TCP)

• Built on top of IP (Internet

Protocol)

• Assumes IP might lose some data

- stores and retransmits data if it

seems to be lost

• Handles “flow control” using a

transmit window

• Provides a nice reliable pipe

Source: http://en.wikipedia.

org/wiki/Internet_Protocol_Suite

http://www.flickr.com/photos/kitcowan/2103850699/

http://en.wikipedia.org/wiki/Tin_can_telephone

TCP Connections / Sockets

http://en.wikipedia.org/wiki/Internet_socket

“In computer networking, an Internet socket or network socket is an

endpoint of a bidirectional inter-process communication flow across

an Internet Protocol-based computer network, such as the Internet.”

Internet

Process

Socket

TCP Port Numbers

• A port is an application-specific or process-specific

software communications endpoint

• It allows multiple networked applications to coexist on the

same server.

• There is a list of well-known TCP port numbers

http://en.wikipedia.org/wiki/TCP_and_UDP_port

www.umich.edu

Incoming

E-Mail

Web Server

Personal

Mail Box

443

109

110

74.208.28.177

blah blah

Please connect me to the

web server (port 80) on

http://www.dr-chuck.com

Clipart: http://www.clker.com/search/networksym/1

Common TCP Ports

http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers

Sometimes we see the port number in the URL if

the web server is running on a “non-standard” port.

Sockets in Python

• Python has built-in support for TCP Sockets

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

mysock.connect( ('www.py4inf.com', 80) )

http://docs.python.org/library/socket.html

Host Port

http://xkcd.com/353/

Application Protocol

• Since TCP (and Python) gives us a

reliable socket, what do we want to

do with the socket? What problem

do we want to solve?

• Application Protocols

• Mail

• World Wide Web

Source: http://en.wikipedia.

org/wiki/Internet_Protocol_Suite

HTTP - Hypertext Transport

Protocol

• The dominant Application Layer Protocol on the Internet

• Invented for the Web - to Retrieve HTML, Images, Documents, etc

• Extended to be data in addition to documents - RSS, Web Services,

etc..

• Basic Concept - Make a Connection - Request a document - Retrieve

the Document - Close the Connection

http://en.wikipedia.org/wiki/Http

HTTP

The HyperText Transport Protocol is the set of rules to

allow browsers to retrieve web documents from

servers over the Internet

What is a Protocol?

• A set of rules that all parties follow so we can

predict each other’s behavior

• And not bump into each other

• On two-way roads in USA, drive on the

right-hand side of the road

• On two-way roads in the UK, drive on the

left-hand side of the road

http://www.dr-chuck.com/page1.htm

protocol host document

Robert Cailliau

CERN

http://www.youtube.com/watch?v=x2GylLq59rI

1:17 - 2:19

Getting Data From The Server

• Each time the user clicks on an anchor tag with an href= value to

switch to a new page, the browser makes a connection to the web

server and issues a “GET” request - to GET the content of the page at

the specified URL

• The server returns the HTML document to the browser, which

formats and displays the document to the user

Making an HTTP request

• Connect to the server like www.dr-chuck.com

• a “hand shake”

• Request a document (or the default document)

• GET http://www.dr-chuck.com/page1.htm

• GET http://www.mlive.com/ann-arbor/

• GET http://www.facebook.com

Browser

Web Server

Browser

GET http://www.dr-chuck.com/page2.htm

Web Server

Browser

<h1>The Second Page</h1>

<p>If you like, you can switch

back to the <a href="page1.

htm">First Page</a>.</p>

Web Server

GET http://www.dr-chuck.com/page2.htm

Browser

Web Server

<h1>The Second Page</h1>

<p>If you like, you can switch

back to the <a href="page1.

htm">First Page</a>.</p>

GET http://www.dr-chuck.com/page2.htm

Internet Standards

• The standards for all of the Internet

protocols (inner workings) are

developed by an organization

• Internet Engineering Task Force (IETF)

• www.ietf.org

• Standards are called “RFCs” - “Request

for Comments”

Source: http://tools.ietf.org/html/rfc791

http://www.w3.org/Protocols/rfc2616/rfc2616.txt

Making an HTTP request

• Connect to the server like www.dr-chuck.com

• a “hand shake”

• Request a document (or the default document)

• GET http://www.dr-chuck.com/page1.htm

• GET http://www.mlive.com/ann-arbor/

• GET http://www.facebook.com

“Hacking” HTTP

$ telnet www.dr-chuck.com 80

Trying 74.208.28.177...

Connected to www.dr-chuck.com.

Escape character is '^]'.

GET http://www.dr-chuck.com/page1.htm HTTP/1.0

<h1>The First Page</h1>

<p>If you like, you can switch to the

<a href="http://www.dr-chuck.com/page2.htm">Second Page</a>.

</p>

HTTP

Request

HTTP

Response

Browser

Web Server

Port 80 is the non-encrypted HTTP port

Accurate Hacking in

the Movies

• Matrix Reloaded

• Bourne Ultimatum

• Die Hard 4

• ...

http://nmap.org/movies.html

$ telnet www.dr-chuck.com 80

Trying 74.208.28.177...

Connected to www.dr-chuck.com.Escape character is '^]'.

GET http://www.dr-chuck.com/page1.htm HTTP/1.0

<h1>The First Page</h1>

<p>If you like, you can switch to the

<a href="http://www.dr-chuck.com/page2.htm">Second

Page</a>.</p>

Connection closed by foreign host.

Hmmm - This looks kind of Complex.. Lots of GET commands

si-csev-mbp:tex csev$ telnet www.umich.edu 80

Trying 141.211.144.190...

Connected to www.umich.edu.Escape character is '^]'.

GET /

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http:

//www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="

http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"

><head><title>University of Michigan</title><meta name="

description" content="University of Michigan is one of the top

universities of the world, a diverse public institution of

higher learning, fostering excellence in research. U-M provides

outstanding undergraduate, graduate and professional education,

serving the local, regional, national and international

communities." />

...

/CSS/accessible.css" media="screen" title="accessible" /><link

rel="stylesheet" href="/CSS/print.css" media="print,projection"

/><link rel="stylesheet" href="/CSS/other.css" media="handheld,

tty,tv,braille,embossed,speech,aural" />... <dl><dt><a href="

http://ns.umich.edu/htdocs/releases/story.php?id=8077">

<img src="/Images/electric-brain.jpg" width="114" height="77"

alt="Top News Story" /></a><span class="verbose">:

</span></dt><dd><a href="http://ns.umich.

edu/htdocs/releases/story.php?id=8077">Scientists harness the

power of electricity in the brain</a></dd></dl>

As the browser reads the document, it finds other

URLs that must be retrieved to produce the document.

The big picture...

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0

Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.

dtd">

<head>

<title>University of Michigan</title>

....

@import "/CSS/graphical.css"/**/;

p.text strong, .verbose, .verbose p, .verbose h2{text-indent:-876

em;position:absolute}

p.text strong a{text-decoration:none}

p.text em{font-weight:bold;font-style:normal}

div.alert{background:#eee;border:1px solid red;padding:.5em;

margin:0 25%}

a img{border:none}

.hot br, .quick br, dl.feature2 img{display:none}

div#main label, legend{font-weight:bold}

...

A browser debugger reveals detail...

• Most browsers have a developer mode so you can watch it in action

• It can help explore the HTTP request-response cycle

• Some simple-looking pages involve lots of requests:

• HTML page(s)

• Image files

• CSS Style Sheets

• JavaScript files

Let’s Write a Web Browser!

An HTTP Request in Python

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

mysock.connect(('www.py4inf.com', 80))

mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')

while True:

data = mysock.recv(512)

if ( len(data) < 1 ) :

break

print data

mysock.close()

HTTP/1.1 200 OK

Date: Sun, 14 Mar 2010 23:52:41 GMT

Server: Apache

Last-Modified: Tue, 29 Dec 2009 01:31:22 GMT

ETag: "143c1b33-a7-4b395bea"

Accept-Ranges: bytes

Content-Length: 167

Connection: close

Content-Type: text/plain

But soft what light through yonder window breaks

It is the east and Juliet is the sun

Arise fair sun and kill the envious moon

Who is already sick and pale with grief

while True:

data = mysock.recv(512)

if ( len(data) < 1 ) :

break

print data

HTTP Header

HTTP Body

Making HTTP Easier With urllib

Using urllib in Python

Since HTTP is so common, we have a library that does all the socket

work for us and makes web pages look like a file

import urllib

fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt')

for line in fhand:

print line.strip()

http://docs.python.org/library/urllib.html

urllib1.py

import urllib

fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt')

for line in fhand:

print line.strip()

But soft what light through yonder window breaks

It is the east and Juliet is the sun

Arise fair sun and kill the envious moon

Who is already sick and pale with grief

urllib1.py

http://docs.python.org/library/urllib.html

Like a file...

import urllib

fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt')

counts = dict()

for line in fhand:

words = line.split()

for word in words:

counts[word] = counts.get(word,0) + 1

print counts

urlwords.py

Reading Web Pages

import urllib

fhand = urllib.urlopen('http://www.dr-chuck.com/page1.htm')

for line in fhand:

print line.strip()

<h1>The First Page</h1>

<p>

If you like, you can switch to the <a href="

http://www.dr-chuck.com/page2.htm">Second

Page</a>.

</p>

urllib2.py

Going from one page to another...

import urllib

fhand = urllib.urlopen('http://www.dr-chuck.com/page1.htm')

for line in fhand:

print line.strip()

<h1>The First Page</h1>

<p>

If you like, you can switch to the

<a href="http://www.dr-chuck.com/

page2.htm">Second Page</a>.

</p>

Google

import urllib

fhand = urllib.urlopen('http://www.dr-chuck.com/page1.htm')

for line in fhand:

print line.strip()

Parsing HTML

(a.k.a. Web Scraping)

What is Web Scraping?

• When a program or script pretends to be a browser and retrieves

web pages, looks at those web pages, extracts information, and then

looks at more web pages.

• Search engines scrape web pages - we call this “spidering the web”

or “web crawling”

http://en.wikipedia.org/wiki/Web_scraping

http://en.wikipedia.org/wiki/Web_crawler

Server

GET

HTML

GET

HTML

Why Scrape?

• Pull data - particularly social data - who links to who?

• Get your own data back out of some system that has no “export

capability”

• Monitor a site for new information

• Spider the web to make a database for a search engine

Scraping Web Pages

• There is some controversy about web page scraping and some sites

are a bit snippy about it.

• Google: facebook scraping block

• Republishing copyrighted information is not allowed

• Violating terms of service is not allowed

http://www.facebook.com/terms.php

The Easy Way - Beautiful Soup

• You could do string searches the hard way

• Or use the free software called BeautifulSoup from www.

crummy.com

http://www.crummy.com/software/BeautifulSoup/

http://www.pythonlearn.com/code/BeautifulSoup.py

Place the BeautifulSoup.py file in the same folder as your Python code...

import urllib

from BeautifulSoup import *

url = raw_input('Enter - ')

html = urllib.urlopen(url).read()

soup = BeautifulSoup(html)

# Retrieve a list of the anchor tags

# Each tag is like a dictionary of HTML attributes

tags = soup('a')

for tag in tags:

print tag.get('href', None)

urllinks.py

html = urllib.urlopen(url).read()

soup = BeautifulSoup(html)

tags = soup('a')

for tag in tags:

print tag.get('href', None)

python urllinks.py

Enter - http://www.dr-chuck.com/page1.htm

http://www.dr-chuck.com/page2.htm

<h1>The First Page</h1>

<p>If you like, you can switch to the<a href="

http://www.dr-chuck.com/page2.htm"

>Second Page</a>.</p>

Summary

• The TCP/IP gives us pipes / sockets between applications

• We designed application protocols to make use of these pipes

• HyperText Transport Protocol (HTTP) is a simple yet powerful

protocol

• Python has good support for sockets, HTTP, and HTML

parsing

Acknowledgements / Contributions

chuck.com) of the University of Michigan School of Information

and open.umich.edu and made available under a Creative

Commons Attribution 4.0 License. Please maintain this last slide

in all copies of the document to comply with the attribution

requirements of the license. If you make a change, feel free to

add your name and organization to the list of contributors on this

page as you republish the materials.

Initial Development: Charles Severance, University of Michigan

School of Information

… Insert new Contributors here

...