项目目标

收集最新的iOS App排名,并分析流行趋势

采集篇

使用到的tools

  1. scrapy: 基于python的网页采集框架
  2. scrapydweb: 用于 Scrapyd 集群管理的 web 应用,支持 Scrapy 日志分析和可视化。
  3. docker: 多服务容器管理

1. 创建docker instance

1.1 准备工作:

  1. 一台linux服务器
  2. 安装 docker 以及 docker-compose 工具

1.2 文件目录

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
app_trend
/code/ # 爬虫的python code放这儿
/scrapy_web/ # scrapydweb 的 config,logs 以及 build file
/app/
# scrapydweb 的 config 文件
# 用来override https://github.com/my8100/scrapydweb/blob/master/scrapydweb/default_settings.py
/scrapydweb_settings_v10.py
/logs/
/data/
/Dockerfile
/scrapyd/
/scrapyd.conf
/Dockerfile
/data/
# 远程调用scrapyd的任务的output会在这
/code/
# 自定义启动scrapyd的脚本
/entrypoint.sh

1.2 创建 scrapydweb 镜像

Filename: scrapy_web/Dockerfile

[展开文件]
1
2
3
4
5
6
7
8
9
10
11
12
FROM python:3.8-slim

WORKDIR /app

EXPOSE 5000
RUN apt-get update && \
apt-get install -y git && \
pip3 install -U git+https://github.com/my8100/scrapydweb.git && \
apt-get remove -y git
# 通过这个来override一些dependency的version
# RUN pip3 install SQLAlchemy==1.3.23 --upgrade
CMD scrapydweb

1.3 创建 scrapyd 镜像

[启动logparser和scrapyd] Filename: scrapyd/code/entrypoint.sh

1
2
#!/bin/bash
logparser -dir /var/lib/scrapyd/logs -t 10 --delete_json_files & scrapyd

[创建镜像] Filename: scrapyd/Dockerfile

[展开文件]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
FROM debian:buster
MAINTAINER EasyPi Software Foundation

ENV SCRAPY_VERSION=2.4.1
ENV SCRAPYD_VERSION=1.2.1
ENV PILLOW_VERSION=8.1.0

RUN set -xe \
&& apt-get update \
&& apt-get install -y autoconf \
build-essential \
curl \
git \
libffi-dev \
libssl-dev \
libtool \
libxml2 \
libxml2-dev \
libxslt1.1 \
libxslt1-dev \
python3 \
python3-dev \
python3-distutils \
vim-tiny \
&& apt-get install -y libtiff5 \
libtiff5-dev \
libfreetype6-dev \
libjpeg62-turbo \
libjpeg62-turbo-dev \
liblcms2-2 \
liblcms2-dev \
libwebp6 \
libwebp-dev \
zlib1g \
zlib1g-dev \
&& curl -sSL https://bootstrap.pypa.io/get-pip.py | python3 \
&& pip install git+https://github.com/scrapy/scrapy.git@$SCRAPY_VERSION \
git+https://github.com/scrapy/scrapyd.git@$SCRAPYD_VERSION \
git+https://github.com/scrapy/scrapyd-client.git \
git+https://github.com/scrapinghub/scrapy-splash.git \
git+https://github.com/scrapinghub/scrapyrt.git \
git+https://github.com/python-pillow/Pillow.git@$PILLOW_VERSION \
&& pip install logparser \
&& curl -sSL https://github.com/scrapy/scrapy/raw/master/extras/scrapy_bash_completion -o /etc/bash_completion.d/scrapy_bash_completion \
&& echo 'source /etc/bash_completion.d/scrapy_bash_completion' >> /root/.bashrc \
&& apt-get purge -y --auto-remove autoconf \
build-essential \
curl \
libffi-dev \
libssl-dev \
libtool \
libxml2-dev \
libxslt1-dev \
python3-dev \
&& apt-get purge -y --auto-remove libtiff5-dev \
libfreetype6-dev \
libjpeg62-turbo-dev \
liblcms2-dev \
libwebp-dev \
zlib1g-dev \
&& rm -rf /var/lib/apt/lists/*

EXPOSE 6800
VOLUME ["/code"]
WORKDIR /code
RUN ["chmod", "777", "entrypoint.sh"]
ENTRYPOINT ["./entrypoint.sh"]

[scrapyd的设置文件] Filename: scrapyd/scrapyd.conf

[展开文件]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[scrapyd]
eggs_dir = /var/lib/scrapyd/eggs
logs_dir = /var/lib/scrapyd/logs
items_dir = /var/lib/scrapyd/items
dbs_dir = /var/lib/scrapyd/dbs
jobs_to_keep = 5
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5
#设置成0.0.0.0来允许外网访问
bind_address = 0.0.0.0
http_port = 6800
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher

[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

1.4 编写 docker-compose 文件来定义 container

Filename: docker-compose.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
scrapy:
image: vimagick/scrapyd:py3
command: bash
volumes:
- ./code:/code
working_dir: /code
restart: unless-stopped

scrapy_web:
container_name: scrapy_web
restart: unless-stopped
build: ./scrapy_web/
ports:
- "80:80"
expose:
- "80"
volumes:
- ./scrapy_web/app:/app
- ./scrapy_web/logs:/logs
- ./scrapy_web/data:/data
- ./code:/code
environment:
- PASSWORD
- USERNAME
# 填入本机IP 或者其他运行了爬虫的机器
- SCRAPYD_SERVER_1=[Your_IP]:6800
- PORT=80
- DATA_PATH=/data
depends_on:
- scrapyd

scrapyd:
container_name: scrapyd
build: ./scrapyd
ports:
- "6800:6800"
volumes:
- ./scrapyd/scrapyd.conf:/etc/scrapyd/scrapyd.conf
- ./scrapyd/data:/var/lib/scrapyd
- ./scrapyd/code:/code
restart: unless-stopped

1.5 启动 docker 容器

1
docker-compose up -d

现在可以通过80端口来访问啦

1.6 编写 scrapy 项目

1
docker-compose run --rm scrapy

会启动一个设置好scrapy的容器 在里面可以直接用 scrapy startproject tutorial 并测试你的spider啦
而与此同时这个容器内的 /code 实际上对应的就是外部的 code folder,所以任何文件操作都会保存在这并能从外部访问