Home Sign Up Sign In

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 3124 days ago, the information mentioned may be changed or developed.

https://github.com/intohole/xspider 是再重复造轮子！但让我们一起熟悉

xspider 简单 python 抓取框架

xspider

抓取单线程
简单 api 使用
xpath/css/json 提取器
多种队列
架构代码逻辑清晰，可以了解 spider 抓取过程
it's easy to crawl and extract web;

main.py:

    from xspider.spider.spider import BaseSpider
    from xspider.filters import urlfilter
    from kuailiyu import KuaiLiYu

if __name__ == "__main__":
    spider = BaseSpider(name = "kuailiyu"  , page_processor = KuaiLiYu() , allow_site = ["kuailiyu.cyzone.cn"] , start_urls = ["http://kuailiyu.cyzone.cn/"])
    spider.url_filters.append(urlfilter.UrlRegxFilter(["kuailiyu.cyzone.cn/article/[0-9]*\.html$","kuailiyu.cyzone.cn/index_[0-9]+.html$"]))
    spider.start()

kuailiyu.py
    from xspider import processor 
    from xspider.selector import xpath_selector
    from xspider import model

    class KuaiLiYu(processor.PageProcessor.PageProcessor):

        def __init__(self):
            super(KuaiLiYu , self).__init__()
            self.title_extractor = xpath_selector.XpathSelector(path = "//title/text()")

        def process(self , page , spider):
            items = model.fileds.Fileds()
            items["title"] = self.title_extractor.find(page)
            items["url"] = page.url
            return items

抓取部分有以下工程代码

Supplement 1 · Nov 28, 2017

继续顶，我想在这个工程上花些时间，做成一个带爬虫策略的爬虫框架

10 replies • 2017-12-01 12:56:34 +08:00

1

xiaozizayang

Nov 23, 2017

助攻 https://github.com/howie6879/talonspider

2

tamlok

Nov 23, 2017 via Android

助攻 https://github.com/tamlok/vnote

3

intohole

OP

Nov 23, 2017

@xiaozizayang 学习一下

4

intohole

OP

Nov 23, 2017

@tamlok 好屌～

5

j1wu

Nov 23, 2017

JavaScript 版本助攻，向大家学习 Orz https://github.com/j1wu/cli-scraper

6

zhangysh1995

Nov 23, 2017

最近正好在学爬虫，收藏一个，楼主加油！

7

intohole

OP

Nov 24, 2017

1

@j1wu 屌屌的

8

intohole

OP

Nov 24, 2017

@zhangysh1995 里面的 api 没有整理，这个爬虫专门为了机器不足时间来换的开发

9

sparkssssssss

Dec 1, 2017

马克,学习

10

intohole

OP

Dec 1, 2017

@coolloves 感谢关注

About · Help · Advertise · Blog · API · FAQ · Solana · 2474 Online Highest 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 42ms · UTC 01:01 · PVG 09:01 · LAX 18:01 · JFK 21:01
♥ Do have faith in what you're doing.