nginx配置防止爬虫 |

吴老二 2021年3月16日18:10:17 评论 2,527 次浏览

先鄙视一波，爬虫的那些人，如果只是测试自己的工具多好，可以爬取百度百科数据，如果想获取别人的东西只要跟别人留言作为转发，保留别人的链接，基本都能获取到数据。那些不劳而获的人，做也只是暂时的，不会长久，下面就针对防爬虫做一下限制。

在做一个爬虫的配置文件，里面基本包含了所有的爬虫策略：

#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
     return 403;
}

#禁止指定UA及UA为空的访问
if ($http_user_agent ~ "WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot|^$" ) {
     return 403;
}

#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
    return 403;
}

#屏蔽单个IP的命令是
#deny 123.45.6.7
#封整个段即从123.0.0.1到123.255.255.254的命令
#deny 123.0.0.0/8
#封IP段即从123.45.0.1到123.45.255.254的命令
#deny 124.45.0.0/16
#封IP段即从123.45.6.1到123.45.6.254的命令是
#deny 123.45.6.0/24

把这些爬虫的信息，加入到nginx的配置文件中，我这里在80端口和443端口都做了配置。

    server {
        listen 80;
        server_name www.wulaoer.org wulaoer.org;
        index index.html index.htm index.php;
        ...................;
        include enable-php.conf;
        include  /usr/local/nginx/conf/anti_spider.conf;  #爬虫配置文件

    server {
        listen 443 ssl;
        server_name www.wulaoer.org wulaoer.org;
        index index.html index.htm index.php;
        ..................;
        include enable-php.conf;
        include  /usr/local/nginx/conf/anti_spider.conf;   #爬虫配置文件

然后重启nginx，重启后进行一下验证，

[root@wulaoer ~]# curl -I -A "Scrapy" www.wulaoer.org
HTTP/1.1 403 Forbidden
Server: nginx
Date: Tue, 16 Mar 2021 03:09:17 GMT
Content-Type: text/html
Content-Length: 146
Connection: keep-alive

策略已经做好，不过这里有一个弊端就是，用户只需要模拟浏览器就可以继续爬取数据，不是最终方法。

您可以选择一种方式赞助本站

支付宝扫一扫赞助

微信钱包扫描赞助

发表评论取消回复

登录 注册 找回密码

登录注册找回密码