Python 爬虫

项目结构：

1.items.py: 用来存放爬虫取下来的数据类型。
2.middlewares.py: 用来存放各种中间的文件。
3.pipelines.py: 用来将items的模型存储到本地磁盘中
4.setting.py: 本地爬虫的配置信息(比如请求头、多久发送一次请求、ip代理等问题)

Scrapy爬虫笔记：

1.response是一个scrapy.http.response.html.HtmlResponse对象可以执行xpath和css语法来进行提取数据
2.提取出来的数据，是一个Selector或者是一个SelectorList对象，如果想要获取其中的字符串，可以执行getall或者get方法
3.getall：获取的是SelectorList中所有的文本。返回是一个列表
4.get方法：获取的是Selector中的第一个文本，返回的是str类型
5.如果数据解析回来需要交给pipline处理，那么可以使用yield来返回，也可以return一个List
6.item:建议在items.py中定义类
7.pipeling：是专门用来保存数据的。其中有：
open_spider(slef,spider): 当爬虫开始时执行
process_item(slef,spider): item穿过来时会调用
clode_spider(slef,spider):关闭时执行

要激活pipeline，应该在`settings.py`中设置`ITEM_PIPELINES`
1
2
3
4
ITEM_PIPELINES = {
	`....`:300,
	（名字）  （优先级）
}

Xpath格式：

1.title = response.xpath(“//h1[@class=’ph‘]/text()”).get()
其爬去的页面为：

1 2	<div class ="cl"> <h1 class ="ph"> ......

选取节点：

nodename 选取次节点的所有名为nodename的子节点
/ 在当前元素中的，直接子元素。(不会查找孙节点)
// 查找当前节点的子孙节点
@ 选取节点的某属性 //book[@price]
id就是属性
//a/@href 表示获取a标签下的所有href的属性值

谓语：

选取子标签：/bookstore/book[1] 注意下标是从1开始
选取最后一个标签： /bookstore/book[last()]
获取多个小标数 /bookstore/book[position()<3]
//book[@price = 10] 过滤
模糊匹配：//div[contains(@class,”fl”)]

通配符：

实例：

trs = html.xpath(“//tr”) ->返回的是一个列表

tr_2 = html.xpath(“//tr[2]”)[0]

Step 1:

定义header:

header = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",

        "Cookie": "JSESSIONID-WYYY=2Z029TS6mUD4rPoTHs%2BbtDaJa%5CIDNANC2hRhCeSO3EiwryPJA947wCAh5%2BF4%5CtU%2FAdsSgkS68YK%2BXMZVmZ8s%2Bd1j5bU0YyTNOfPPwks98G%2FYBH9IdwUhHBvC%2FOPjlFMzAc6uYwYqP%2BQatEygarBIFcrIBKQfJIv%2FkJV%5C%2FmTy%2F7XM1uQw%3A1563850724300; _iuqxldmzr_=32; _ntes_nnid=e2fd5d29ab308e038279af91a8c84058,1563848924322; _ntes_nuid=e2fd5d29ab308e038279af91a8c84058",

        'Referer':'https://music.163.com/',

        'origin':'https://music.163.com',

        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',

    }

所有的爬虫必须配置其header，在浏览器中找到这些信息，然后构造字典型。

注意：每个网站的Cookie可能不一样需要对比。另外，Referer也是一个重要的反反爬虫的变量。

网站的url如果直接从浏览器中得到可能会有问题，在报文中查看会更好。

Step 2

准备好爬取的url。
使用requests库进行爬取。
查看网页是使用get,还是post进行的通信。

使用lxml 中的etree库，这是个c++的内嵌库，没有智能提示

1 2	respo = requests.get(url= url, headers= header) html = etree.HTML(respo.text) #将其转换为etree以方便得到数据

Step 3

使用 xpath 来获得指定的标签内容。
要领：
Xpath:
Xpath返回的都是列表

title_ = html.xpath("//div[@class = '']//a/@title") #取出div下a标签中的title属性

href_ = BASIC_DOMAIN + html.xpath("//div[@class = '']//a/@href")#取出div下a标签中的href属性, 但这个href只是一段，需要加上原来的域名。

lis = html.xpath("//div[@class = '']//li")#取出所有li标签

li = lis.xpath(".//a/text")

AJAX动态网页数据爬取：

库的导入：

下载chromedriver 注意匹配浏览器和driver的版本
下载selenium库
from selenium import webdriver

Selenium 测试安装：

import time
from selenium import webdriver
DRIVER_PATH=r"D:\Browse\chrome\chromedriver.exe"
driver = webdriver.Chrome(executable_path = DRIVER_PATH) #可以指定浏览器，但是其driver的路径必须是下载好的那一个，且要匹配
url = "https://www.baidu.com/"
driver.get(url) #没有返回值，driver本身是一个对象，调用get方法可以设置其内部的参数
print(driver.page_source) #打印网页源码
time.sleep(10) #代码执行完后，页面会自动关闭

运行上面的代码可以如果启动了浏览器，且访问了baidu主页则证明安装成功。

操作页面中某对象：

获取对象：
driver.find_element_by_id(‘’)
driver.find_element_by_name(‘’)
driver.find_element_by_xpath(“”)

常见表单元素
input type =‘text/password/email/submit’

#input
input = driver.find_element_by_id('kw')
input.send_keys('python')
time.sleep(2)
input.clear()

#checkbox:
checkbox = driver.find_element_by_name('remember')

checkbox.click() 

#select:
需要将获取的元素用Select包裹
selectTag = Select(driver.find_element_by_name('junmpMenu')
# Select(driver.find_elemrnt_by_id())
selectTag.select_by_index(1) #选择小标为1的选项

#click：
#点击事件：
#获取按钮：
submitBtn = driver.find_elemrnt_by_id('su')
submitBtn.click()

button
checkbox
select

行为链：

操作可能要有很多步，可以为其生成行为链ActionChains：

from selenium.webdriver.common.acrion_chains import ActionChains

#先找到对象，有点类似于Android开发。
inputTag = driver.find_elemrnt_by_id('kw')
submitBtn = driver.find_elemrnt_by_id('su')

actions = ActionChains(driver)
actions.move_to_element(inputTag)
actions.send_keys_to_element(inputTag,'python')
actions.move_to_element(submitBtn) #将焦点至于submitBtn按钮处
actions.click() #点击鼠标位置
#疑问：显示的鼠标未移动，但确实点击到了。
actions.perform()

#鼠标操作

by包

find_element_by_id,find_element_by_name,find_element_class 都可以使用by库来简化

导入：from selenium.webdriver.common.by import By
find_element_by_id = find_element(By.ID,’id_name’)

隐式等待和显示等待

from selenium.webdriver.support.ui import WebDriverWati
from selenium.webdriver.support import expected_conditions as EC
隐式等待

1	driver.implicity_wait(10) #等待10s

显示等待：
可以等待某个事件，某个条件满足再执行。
注意
presence_of_elenment_located()中传入一个元组

element = WebDriverWait(self.driver,timeout=100).util(
	EC.presence_of_elenment_located((By.ID,'id_name'))
)
#presence_of_elenment_located某个元素已经加载

print(element)

还有其他的等待条件，可以进该函数中查看

异常处理：

try:
 	fp.open(filename,'w')
 	fp.write(321)
except IOError:
	print "Error 没有找到文件"
else:
	fp.write("123")
finally:
	fp.close()

except 捕获异常，对于不同的异常可以进行不同的处理。

finally 无论有没有异常都会执行最后的代码

自定义异常类：
需要继承子Exception：

class MyException(Except):
	"""docstring for MyException"""
	def __init__(self, msg):
		super(MyException, self).__init__()
		self.msg = msg
		
	def __str__():
		return self.msg

try:
	raise MyException
except MyException as e:
	print"aaa"
else:
	pass
finally:
	pass

页面的打开与关闭：

driver.excute_script("window.open('https://www.douban.com/')")
print(driver.window_handles)
driver.switch_to_window(driver.window_handles[1])
print(driver.current_url)
driver.switch_to_window(driver.window_handles[2])
print(driver.current_url)

虽然在浏览器中打开了新的页面，但是其操作仍是在原来页面，想要对新的页面进行操作必须使用driver.switch_to.window(driver.window_handles[])来获取

注意window_handles时一个列表，其储存顺序是打开的顺序，有时页面打开的比较缓慢，或者有data;页面其页面顺序不一定按序。需要调试

selenium 爬取动态页面

有时网页源码和检查获得到的代码是不一样的，真实内容在检查的页面中才有，这是服务器可能采用了Ajax技术，对于这种技术有一下两种解决方式。

首先明确网页中是否存在iframe标签，在源码中查看，如果存在iframe标签则证明该页的内容在iframe中，随即将anchor转到该iframe中。如果没有iframe，那么可以直接在selenium的源码中得到。

比如:

在拉勾网中其Ajax返回的数据直接写在了page_source中。

在网易云音乐中则是装在了iframe中，需要使用driver.switch_to.frame("contentFrame")来获得。

1	<iframe name="contentFrame" id="g_iframe" class="g-iframe" scrolling="auto" frameborder="0" src="about:blank" allowfullscreen="true"></iframe>

这样就可以爬取动态网页的数据了。

使用selenium时可以不用设置headers的信息，因为本来采用的就是浏览器的信息。

定义爬虫类：

1	class Spider_163(object):

DRIVER_PATH = r"D:\Browse\chrome\chromedriver.exe"
def __init__(self,url):
    self.base_url = url
    driver = webdriver.Chrome(executable_path = Spider_163.DRIVER_PATH)
    self.driver = driver

构造函数：new出驱动对象，以及明确爬取的url

def run(self):
    music_lists = []
    for page in range(1,1 + 1):
        self.driver.get(self.base_url.format((page-1)*35))
        self.driver.switch_to.frame("contentFrame")
        source = self.driver.page_source
        source = etree.HTML(source)
        # print(etree.tostring(source, encoding='utf-8').decode('utf-8'))
        lis = source.xpath("//ul[@class = 'm-cvrlst f-cb']//li")
        for li in lis:
            list_img_href = li.xpath(".//img/@src")
            list_title = li.xpath(".//p[@class = 'dec']/a/@title")
            list_author = li.xpath(".//a[@class='nm nm-icn f-thide s-fc3']/@title")
            list_author_homepage = BASIC_DOMAIN + li.xpath(".//a[@class='nm nm-icn f-thide s-fc3']/@href")[0]
            list_detail_href = BASIC_DOMAIN + li.xpath(".//p[@class = 'dec']/a/@href")[0]
            music_list = {
                "list_img_href":list_img_href,
                "list_title" : list_title,
                "list_author" : list_author,
                "list_author_homepage" : list_author_homepage,
                "list_detail_href" : list_detail_href
            }
            music_lists.append(music_list)
            print(music_list)
            time.sleep(1)
    with open("music_lists.json",'w',encoding='utf-8') as fp:
        json.dump(music_lists,fp,ensure_ascii=False)
    self.sele_parse_detail_music_list(music_lists)
        # rep = requests.get(url=self.url)
        # html = etree.HTML(rep)
        # lis = html.xpath("//ul[@class = 'm-cvrlst f-cb']/li")
        # print(lis)

run函数，爬虫的动力口：先爬取所有的页面信息
网站结构示意：

-web|
	|-lists_page1|
	|			 |-music_list1	|
	|			 |				|-music1
	|			 |				|-music2
	|			 |				|-music3
	|			 |				|...
	|			 |
	|			 |-music_list2
	|			 |
	|			 |-music_list3
	|				-...	
	|-lists_page2
	|
	|-lists_page3
	|-...

代码中采用的逻辑是先爬取所有list页面中所有的list_page中的music_lists，将所有的list信息放入一个字典中
然后遍历字典，进入所有的music_list页面。将所有的music放入列表中
每个页面的具体内容都在self.driver.switch_to.frame("contentFrame") 中
注意

xpath返回的是列表，包含了所有符合xpath筛选条件的标签，所有描述越精确定位目标越明确。
注意标签之间的包含关系。
有时某些标签未加载进来就被访问了，这时会报错，需要设置显示等待。但本案例中没有遇到这种情况，所有没有考虑。参考一下：显示等待页面响应：
1
2
3
4
WebDriverWait(driver,timeout=10).util(
EC.presence_of_elenment_located((By.XPATH,"//....."))
)
#等待出现后再执行下面的代码

def sele_parse_detail_music_list(self,musici_lists):
    li=[]
    for _list in musici_lists:
        self.driver.get(_list["list_detail_href"])
        self.driver.switch_to.frame("contentFrame")
        source = self.driver.page_source
        source = etree.HTML(source)
        # print(etree.tostring(source,encoding='utf-8').decode('utf-8'))
        trs = source.xpath("//table[@class = 'm-table ']/tbody//tr")
        # print(etree.tostring(trs[0], encoding='utf-8').decode('utf-8'))
        for tr in trs:
            music_detail_name = tr.xpath(".//span[@class = 'txt']/a/b/@title")[0]
            music_detail_href = BASIC_DOMAIN + tr.xpath(".//span[@class = 'txt']/a/@href")[0]
            music_detail_dur = tr.xpath(".//span[@class = 'u-dur ']/text()")[0]
            music_detail_artist = tr.xpath(".//div[@class = 'text']/@title")[0]
            music_detail = {
                "music_detail_name" : music_detail_name,
                "music_detail_href" : music_detail_href,
                "music_detail_dur" : music_detail_dur,
                "music_detail_artist" : music_detail_artist
            }
            li.append(music_detail)
    with open("music_details.json",'w',encoding='utf-8') as fp:
        json.dump(li,fp,ensure_ascii=False)

json数据转换：

import json

persons = [
	{
		'name' : "a"
		'age' : 12
		'sex' : 'male'
	}
	{
		...
	}
]

json_str = json.dumps(person) #生成json字符串
##然后再将字符串写入fp中

#或者直接使用dump函数，如果有中文的话注意编码格式
with open("music_lists.json",'w',encoding='utf-8') as fp:
    json.dump(music_lists,fp,ensure_ascii=False)  #直接写进文件

打开文件需要使用utf-8的格式，写入的时候不按ascii码写入。这样中文就可以正确显示

写入后的json文件是一行字符串，可以在json.cn中解码，在复制过来，可以保持好看的格式

%字符占位填充：

1
2
3

driver.execute_script("window.open('"+url_douban+"')") #运行js的代码

driver.execute_script("window.open('%s')" % url_douban ) #运行js的代码

文件路径的操作：

__file__属性代表了当前执行的文件的绝对路径。


os.path.dirname(__file__) #可以打印文件的上级目录的绝对路径

path = (os.path.join(os.path.dirname(__file__),"images","163music")) #可以组成同祖先的文件路径。


if not os.path.exists(path)
    os.mkdir(path)

img_name = url.split("_")[-1]

request.urlretrive(url,os.path.joint(path,img_name))

使用os.path.join,os.path.dirname可以构造各种路径，方便将数据存储到目标文件中。
使用os.path.exists(file_path)可以检测文件是否存在，防止覆盖，既可以看文件夹，也可以看文件。

正则表达式：

格式：re.match(“正则表达式”，字符串)

匹配字符串:match

1
2
3

ret = re.match("he",text)

print(ret.group())

match从开头开始匹配，头不匹配则直接匹配失败。

`.`匹配任意一个字符：

但是无法匹配换行符。

1
2
3

ret = re.match(".",text)

print(ret.group())

`\d`匹配任意一个数字：

1
2
3

ret = re.match("\d",text)

print(ret.group())

`\D`匹配任意一个f非数字：

1
2
3

ret = re.match("\D",text)

print(ret.group())

`\s`匹配任意一个空白字符(\n,\t,\r ,’ ‘)：

1
2
3

ret = re.match("\s",text)

print(ret.group())

`\w`匹配任意一个a`-z A-Z 0-9` `_`以及下划线

\W 与其相反

1
2
3

ret = re.match("\w",text)

print(ret.group())

`[]`

匹配一个在[]中的字符
但是要注意keyword需要转义字符
是取反的意思

1
2
3

ret = re.match("[\d\-]",text)

print(ret.group())

`+`

前面的字符是一个或多个

1
2
3

ret = re.match("[\d\-]+",text)

print(ret.group())

[\d\-]+可以是一个或多个满足。

`*`

前面的字符是0个或多个

1
2
3

ret = re.match("[\d\-]*",text)

print(ret.group())

于+的区别仅限于可以0个

`？`

匹配0个或1个字符,
与.的区别：可以包含换行符，且?是修饰符，前面必须要有格式

1
2
3

text = "\n"
ret = re.match(r"\d?",text)
print(ret.group())

`{m}` `{m,n}` `{m,}`

匹配m个前面格式的字符
不满足m个也会报错

1
2
3

ret = re.match("[\d]{2}",text) #匹配2个数字，不满足两个也会报错

print(ret.group())

匹配邮箱：

注意.需要转义

1
2
3

ret = re.match("\w+@[a-z0-9]+\.[a-z]+",text) #注意.需要转义

print(ret.group())

匹配url：

1
2
3

ret = re.match(r"(http|https|ftp)://[^\s]+", img_href)  # 注意.需要转义

print(ret.group())

group分组：

text = "apple's price $99, orange's price is $10"
ret = re.search(r"(\S+)\$(\d+),(\S+)\$(\d+)")
price={
    "apple":group(2),
    "orange":group(4)
}
print(price)

findall:

1
2
3

text = "apple's price $99, orange's price is $10"
rtn = re.findall("\$\d+",text)
print(rtn)

sub：

替换字符串中的目标：
sub(‘正则表达式’,’替换成什么’,test,count) count是指替换几个

1
2
3

text = "apple's price $99, orange's price is $10"
rtn = re.sub("\$\d+",'$0',text)
print(rtn)

正则表达式的贪婪模式：

使用？符合来表明非贪婪模式。

text = "apple's price $99, orange's price is $10"
rtn = re.sub("a.*i",'$0',text)
print(rtn)
# $0s $10

text = "apple's price $99, orange's price is $10"
rtn = re.sub("a.*?i",'$0',text)
print(rtn)
# $0ce $99, or$0ce is $10

贪婪会尽可能多的匹配字符，特别注意定两端匹配中间的字符时，是否两端可以更大。
比如：<>><><><<>><> 匹配<>的话，非贪婪会全部匹配。

实例：将html 的标签去掉

html = """
<!DOCTYPE html>
<html>
<head>
    <title>哈哈这只剩一个</title>
</head>
<body>

</body>
</html>
"""
rtn = re.sub('<.*+?>','',html)

会将所有的<…>标签都去掉

spit：

re.split(‘正则’，text)
正则是分割符

1 2	re.split(' '，text) #空格作为分隔符

compile

将正则表达式先编译：

r = re.compile(r"""
    \d+ #整数
    \.? #小数点
    \d* #小数
                """,re.VERBOSE
    )

ret = re.search(r,text)

会根据正则表达式来分割字符串

文件，图片的下载：

首先要得到文件、图片的url