项目目标
收集最新的iOS App排名,并分析流行趋势
采集篇
- scrapy: 基于python的网页采集框架
- scrapydweb: 用于 Scrapyd 集群管理的 web 应用,支持 Scrapy 日志分析和可视化。
- docker: 多服务容器管理
1. 创建docker instance
1.1 准备工作:
- 一台linux服务器
- 安装 docker 以及 docker-compose 工具
1.2 文件目录
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| app_trend /code/ # 爬虫的python code放这儿 /scrapy_web/ # scrapydweb 的 config,logs 以及 build file /app/ # scrapydweb 的 config 文件 # 用来override https://github.com/my8100/scrapydweb/blob/master/scrapydweb/default_settings.py /scrapydweb_settings_v10.py /logs/ /data/ /Dockerfile /scrapyd/ /scrapyd.conf /Dockerfile /data/ # 远程调用scrapyd的任务的output会在这 /code/ # 自定义启动scrapyd的脚本 /entrypoint.sh
|
1.2 创建 scrapydweb 镜像
Filename: scrapy_web/Dockerfile
[展开文件]
1 2 3 4 5 6 7 8 9 10 11 12
| FROM python:3.8-slim
WORKDIR /app
EXPOSE 5000 RUN apt-get update && \ apt-get install -y git && \ pip3 install -U git+https://github.com/my8100/scrapydweb.git && \ apt-get remove -y git # 通过这个来override一些dependency的version # RUN pip3 install SQLAlchemy==1.3.23 --upgrade CMD scrapydweb
|
1.3 创建 scrapyd 镜像
[启动logparser和scrapyd] Filename: scrapyd/code/entrypoint.sh
1 2
| #!/bin/bash logparser -dir /var/lib/scrapyd/logs -t 10 --delete_json_files & scrapyd
|
[创建镜像] Filename: scrapyd/Dockerfile
[展开文件]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
| FROM debian:buster MAINTAINER EasyPi Software Foundation
ENV SCRAPY_VERSION=2.4.1 ENV SCRAPYD_VERSION=1.2.1 ENV PILLOW_VERSION=8.1.0
RUN set -xe \ && apt-get update \ && apt-get install -y autoconf \ build-essential \ curl \ git \ libffi-dev \ libssl-dev \ libtool \ libxml2 \ libxml2-dev \ libxslt1.1 \ libxslt1-dev \ python3 \ python3-dev \ python3-distutils \ vim-tiny \ && apt-get install -y libtiff5 \ libtiff5-dev \ libfreetype6-dev \ libjpeg62-turbo \ libjpeg62-turbo-dev \ liblcms2-2 \ liblcms2-dev \ libwebp6 \ libwebp-dev \ zlib1g \ zlib1g-dev \ && curl -sSL https://bootstrap.pypa.io/get-pip.py | python3 \ && pip install git+https://github.com/scrapy/scrapy.git@$SCRAPY_VERSION \ git+https://github.com/scrapy/scrapyd.git@$SCRAPYD_VERSION \ git+https://github.com/scrapy/scrapyd-client.git \ git+https://github.com/scrapinghub/scrapy-splash.git \ git+https://github.com/scrapinghub/scrapyrt.git \ git+https://github.com/python-pillow/Pillow.git@$PILLOW_VERSION \ && pip install logparser \ && curl -sSL https://github.com/scrapy/scrapy/raw/master/extras/scrapy_bash_completion -o /etc/bash_completion.d/scrapy_bash_completion \ && echo 'source /etc/bash_completion.d/scrapy_bash_completion' >> /root/.bashrc \ && apt-get purge -y --auto-remove autoconf \ build-essential \ curl \ libffi-dev \ libssl-dev \ libtool \ libxml2-dev \ libxslt1-dev \ python3-dev \ && apt-get purge -y --auto-remove libtiff5-dev \ libfreetype6-dev \ libjpeg62-turbo-dev \ liblcms2-dev \ libwebp-dev \ zlib1g-dev \ && rm -rf /var/lib/apt/lists/*
EXPOSE 6800 VOLUME ["/code"] WORKDIR /code RUN ["chmod", "777", "entrypoint.sh"] ENTRYPOINT ["./entrypoint.sh"]
|
[scrapyd的设置文件] Filename: scrapyd/scrapyd.conf
[展开文件]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| [scrapyd] eggs_dir = /var/lib/scrapyd/eggs logs_dir = /var/lib/scrapyd/logs items_dir = /var/lib/scrapyd/items dbs_dir = /var/lib/scrapyd/dbs jobs_to_keep = 5 max_proc = 0 max_proc_per_cpu = 4 finished_to_keep = 100 poll_interval = 5 #设置成0.0.0.0来允许外网访问 bind_address = 0.0.0.0 http_port = 6800 debug = off runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.Launcher
[services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions.json = scrapyd.webservice.ListVersions listspiders.json = scrapyd.webservice.ListSpiders delproject.json = scrapyd.webservice.DeleteProject delversion.json = scrapyd.webservice.DeleteVersion listjobs.json = scrapyd.webservice.ListJobs daemonstatus.json = scrapyd.webservice.DaemonStatus
|
1.4 编写 docker-compose 文件来定义 container
Filename: docker-compose.yml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
| scrapy: image: vimagick/scrapyd:py3 command: bash volumes: - ./code:/code working_dir: /code restart: unless-stopped
scrapy_web: container_name: scrapy_web restart: unless-stopped build: ./scrapy_web/ ports: - "80:80" expose: - "80" volumes: - ./scrapy_web/app:/app - ./scrapy_web/logs:/logs - ./scrapy_web/data:/data - ./code:/code environment: - PASSWORD - USERNAME # 填入本机IP 或者其他运行了爬虫的机器 - SCRAPYD_SERVER_1=[Your_IP]:6800 - PORT=80 - DATA_PATH=/data depends_on: - scrapyd
scrapyd: container_name: scrapyd build: ./scrapyd ports: - "6800:6800" volumes: - ./scrapyd/scrapyd.conf:/etc/scrapyd/scrapyd.conf - ./scrapyd/data:/var/lib/scrapyd - ./scrapyd/code:/code restart: unless-stopped
|
1.5 启动 docker 容器
现在可以通过80端口来访问啦
1.6 编写 scrapy 项目
1
| docker-compose run --rm scrapy
|
会启动一个设置好scrapy的容器 在里面可以直接用 scrapy startproject tutorial
并测试你的spider啦
而与此同时这个容器内的 /code
实际上对应的就是外部的 code
folder,所以任何文件操作都会保存在这并能从外部访问