python爬虫【学习资料】

爬虫干嘛用

有时候看别人写的网站很不错，想整个下载下来学习学习，但是不可能每个页面都点过去右键保存吧，这就需要爬虫来自动下载网页。

谷歌搜索也是基于爬虫自动保存了大量网页在数据库中才能索引的到。

我想写个脚本能从视频网站上自动下载视频。

crawler VS. scraping

scraping就是将目标网站的html扒下来
crawler 和scraping的不同之处，爬虫会从一个根网址，自动爬到这个页面所连接的网页，也就是说给爬虫一个门户网站网址，可以把整个网站成千上万的网页全部扒下来

scraping

库

beautifulsoup 解析得到的html数据
scrapy 开源的爬虫
urllib 向目标网址发起请求，得到数据

Crawler

教程

大神newboston自个从零开始写了一个爬虫：

youtube playlist的地址

如何抓取被隐藏的部分

有些网站的元素用chrome的开发窗口可以看到，但是用bs4就是抓不到。

这篇文章介绍如何用scrapy 和Selenium 抓取这些隐藏的部分

The data we want to scrap is generated dynamically and presented on the screen after interacting with the Web.
The data is loaded dynamically after some AJAX communication.

from How to scrape hidden Web data with Python

scrapy

安装指南

各种dependency的问题还是建议用anaconda安装吧，这又是一个坑

conda install -c conda-forge scrapy

Selenium

用程序模拟浏览器的功能，自动点击和网页互动，用于网页测试。

教程

Qiita

感觉坑越挖越大了，我本来想做什么来着？？？下载视频啊。。。。

代码实现

基本思路

直接用beautiful soup下载下来的html不包含我要的视频下载链接
- why？网页是动态页面，视频的部分不会立刻被显示出来，加载广告什么的需要时间
selenium监听网页，当某个标签被激活的时候下载整个html
下载下来的html传递给beautiful soup 解析，抽取出视频地址
urllib.request.urlretrieve(url,filename) 保存视频

import urllib.request
import io
import sys
from bs4 import BeautifulSoup

sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') #坑！改变标准输出的默认编码,中文页面被python默认为gkb会乱码。

#将任意string写入文件
def write_file(path,data):
    f=open(path,'w',encoding='utf8')#encoding必须
    f.write(data)
    f.close()
#保存任意的链接后面的文件
def download(url,filename):
    urllib.request.urlretrieve(url,filename)

url="https://the-site-you-want-to-crawl"
#chrome操作器加载
#坑！chromedriver需要下载，并把PATH写在这里
driver = webdriver.Chrome("D:\workplace\chromedriver_win32\chromedriver.exe")

driver.get(url)
try:
    # 等待某一个ID出现时再启动driver
    WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.ID, "theID-you-want-to-access")))  
    # 保存此时的整个html到src
    src = driver.page_source
    # 保存到文件  
    src=str(src)
    # GBK 报错的话无视掉
    src.encode("gbk",'ignore')
    write_file("src.html", src)
except Exception as e:
    print("Exception ",format(e))

f = open('src.html',encoding='utf8')
# file handler形式的html，soup才能打开
html = f.read()
f.close()

soup = BeautifulSoup(html, "lxml")
# soup简单的检索，取得video 标签的部分，src后面的url
video_url = soup.find("video").get("src")

download(video_url,"myVideo.mp4")

感慨一下人生大部分时间都是在解决yak shaving的问题

yak shaving 是给牦牛剃毛的意思，比喻想要达成目标A必须先达成B，要达成B必须先达成C以此类推，最后发现要达成A必须解决B-Z所有问题才行，不过这也是学习的过程无法避免。