之前整理过, 但不怎么明白, 现在再理下~ 要实现抓取需要登录的页面, 主要是设置Cookies, 主要过程如下:
了解HTTP协议和cookies相关, 主要是在RFC2965 http://www.faqs.org/rfcs/rfc2965.html 中描述.
cookies在HTTP消息头部有固定格式, 很多属性是预先定义好的,,,只要遵循这个标准就可.
python相关库有: urllib, urllib2, httplib, httplib2, cookielib, ClientCookie, 这些都是python标准库, 其中, 有两个有用的文章
- Handling Cookies in Python : http://www.voidspace.org.uk/python/articles/cookielib.shtml 讲述一个处理cookies的例子:
#!/usr/bin/python #coding:utf-8 """ 来自: http://www.voidspace.org.uk/python/articles/cookielib.shtml 上的例子 """ import os.path import urllib2 # 要保存的cookies所在文件名 COOKIEFILE = 'cookies.lwp' cj = None ClientCookie = None cookielib = None try: # 看cookielib是否可用 import cookielib except ImportError: try: # cookielib不可用的话, 尝试ClentCookie import ClientCookie except ImportError: # 如果ClientCookie也不可用 urllopen = urllib2.urlopen Request = urllib2.Request else: # ClientCookie导入, urlopen = ClientCookie.urlopen Request = ClientCookie.Request cj = ClientCookie.LWPCookieJar() else: urlopen = urllib2.urlopen Request = urllib2.Request cj = cookielib.LWPCookieJar() if cj is not None: # 也就是上述成功导入ClientCookie或cookielib if os.path.isfile(COOKIEFILE): # 已经存在cookie文件了, 则load进来 cj.load(COOKIEFILE) if cookielib is not None: # 如果使用cookielib, 需获得HTTPCookieProcessor, 并安装opener opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) else: # 如果使用ClientCookie, 同样 opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj)) CLientCookie.install_opener(opener) theurl = 'http://www.google.com/history/' # 如果是POST类型请求, 应使用urllib.urlencod(somedict) txdata = None # 假装浏览器, a user agent txheaders = {'User-Agent':'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.4) Gecko/20061201 Firefox/2.0.0.6 (Ubuntu-feisty)'} try: # 创建一个请求对象 req = Request(theurl, txdata, txheaders) # 打开 handle = urlopen(req) except IOError, e: print 'Failed to open "%s".' % theurl if hasattr(e, 'code'): print 'failed with error code - %s.' % e.code elif hasattr(e, 'reason'): print "The error object has the following 'reason' attribute :" print e.reason print "This usually means the server doesn't exist,", print "is down, or we don't have an internet connection." sys.exit() else: print 'The Headers of the Page:' print handle.info() # handle.read() returns the page # handle.geturl() returns the true url of the page fetched # (in case urlopen has followed any redirects, which it sometimes does) print if cj is None: print "We don't have a cookie library available - sorry." print "I can't show you any cookies." else: print 'These are the cookies we have received so far :' for index, cookie in enumerate(cj): print index, ' : ', cookie cj.save(COOKIEFILE) # 保存cookie
Basic Authentication/Authentication with Python : http://www.voidspace.org.uk/python/articles/authentication.shtml讲述基本认证, 可以是如:
import urllib2 theurl = 'www.someserver.com/toplevelurl/somepage.htm' protocol = 'http://' username = 'johnny' password = 'XXXXXX' # a great password passman = urllib2.HTTPPasswordMgrWithDefaultRealm() # this creates a password manager passman.add_password(None, theurl, username, password) # because we have put None at the start it will always # use this username/password combination for urls # for which `theurl` is a super-url authhandler = urllib2.HTTPBasicAuthHandler(passman) # create the AuthHandler opener = urllib2.build_opener(authhandler) urllib2.install_opener(opener) # All calls to urllib2.urlopen will now use our handler # Make sure not to include the protocol in with the URL, or # HTTPPasswordMgrWithDefaultRealm will be very confused. # You must (of course) use it when fetching the page though. pagehandle = urllib2.urlopen(protocol + theurl) # authentication is now handled automatically for us
实现抓取web history上的历史搜索关键词, 使用的是cookielib
try: # 登录获取cookies cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) opener.addheaders = [('User-Agent','Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.4) Gecko/20061201 Firefox/2.0.0.6 (Ubuntu-feisty)')] url_login = 'https://www.google.com/accounts/ServiceLoginAuth?service=hist' body = (('Email','shengyan1985@gmail.com'), ('Passwd','...')) # 密码! reqlogin = opener.open(url_login,urllib.urlencode(body)) #这时,cookie已经进来了。 print 'The Headers of the Login Page:' print reqlogin.info() except: sys.exit(-1)
但我认为直接使用Cookie.SimpleCookie直接加入header也可以.