请教:Scrapy 抓取 ‘a/text()’中含有 <em></em>标签，如何保留 em 间的文字？

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 4003 天前的主题，其中的信息可能已经有所发展或是发生改变。

参见例子：

HTML:
<a href="http://v2ex.com">网站<em>V2EX</em>是......</a>

Scrapy:
title_array = site.xpath('a/text()').extract()

结果:
["网站","是......"]

谢过先。

site.xpath

Scrapy

抓取

5 条回复 • 2014-09-25 18:58:56 +08:00

yunchenran300

2014-09-25 15:36:09 +08:00

a//text()
参考http://stackoverflow.com/questions/10618016/html-xpath-extracting-text-mixed-in-with-multiple-tags

Melodic

2014-09-25 15:41:34 +08:00

a//text()可以。

但是如果前端写的不整齐，那么更好的办法是使用descendant轴来取所有子节点的文字

a/descendant::text()

shawngao

2014-09-25 16:23:50 +08:00

@yunchenran300
@Melodic

新手，非常感谢！

Melodic

2014-09-25 16:41:47 +08:00

@shawngao 哼，原来楼主是搞ios的，只会python的掩面而泣

shawngao

2014-09-25 18:58:56 +08:00

@Melodic 楼主东打一耙，西挥一棍，现在写代码有时语法都错乱了。尤其是Go与Python...