首页注册登录

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 2409 天前的主题，其中的信息可能已经有所发展或是发生改变。

代码如下：

-- coding: utf-8 --

import scrapy

class LinearSpider(scrapy.Spider): name = "linear" allowed_domains = ["ocw.mit.edu"] start_urls = ['https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/resource-index/']

def parse(self, response):
    page_hrefs = response.xpath("*//tr//td/a/@href").re(".*sum.pdf")
    for href in page_hrefs:
        new_url = 'https://ocw.mit.edu' + href
        print(new_url)
        yield scrapy.Request(new_url,callback=self.parse_href)

def parse_href(self,response):
    with open('linear.pdf','ab') as f:
        f.write(response.body)
        f.close()

3 条回复 • 2018-04-23 17:20:53 +08:00

1

Fuyu0gap

OP

2018-04-20 23:11:23 +08:00

分开下载页面里所有的 PDF 是可以的，以及这个 Markdown 显示不全怎么肥四……

2

pc10201

2018-04-21 19:07:19 +08:00

with open('linear.pdf','ab') as f
pdf 不能这么简单的拼接吧，最好每一个分开下载，再用第三方工具整合在一起

3

Fuyu0gap

OP

2018-04-23 17:20:53 +08:00

@pc10201 我最后也是和你一样的思路解决的，不过 PDF 的拼接原理和文本差异在哪里呢？

关于 · 帮助文档 · 博客 · API · FAQ · 实用小工具 · 1609 人在线 最高记录 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 21ms · UTC 16:58 · PVG 00:58 · LAX 08:58 · JFK 11:58
Developed with CodeLauncher
♥ Do have faith in what you're doing.