纯净的 HTML 解析库, 取代复杂的 beautifulsoup4, pyquery, lxml
github: https://github.com/gaojiuli/htmlparsing
pip install htmlparsing
# or
pip install git+https://github.com/gaojiuli/htmlparsing
import requests
from htmlparsing import Element
url = 'https://python.org'
r = requests.get(url)
e = Element(text=r.text, base_url=url)
e.links
"""
{...'/users/membership/', '/events/python-events', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'}
"""
e.absolute_links
"""
{...'https://python.org/download/alternatives', 'https://python.org/about/success/#software-development', 'https://python.org/download/other/', 'https://python.org/community/irc/'}
"""
e.xpath('//a')[0].attrs
"""{'href': '#content', 'title': 'Skip to content'}"""
e.xpath('//a')[0].attrs.title
"""Skip to content"""
e.css('a')[0].attrs
"""{'href': '#content', 'title': 'Skip to content'}"""
e.parse('<a href="#content" title="Skip to content">{}</a>'))
"""<Result ('Skip to content',) {}>"""
e.xpath('//a')[5].text
"""PyPI"""
e.xpath('//a')[5].html
"""<a href="https://pypi.python.org/" title="Python Package Index">PyPI</a>"""
e.xpath('//a')[5].markdown
"""[PyPI]( https://pypi.python.org/ "Python Package Index")"""
目前支持的选择器: xpath, css ,parse
1
engHacker 2018-02-26 19:40:59 +08:00 2
恕我直言,我感觉你这是对 kenneth 大神的 requests-html( https://github.com/kennethreitz/requests-html)低配仿造啊……
|
3
lhx2008 2018-02-26 19:59:26 +08:00 via Android
还是解析神器 pyquery 好用
|
4
lhx2008 2018-02-26 20:02:48 +08:00 via Android
pyquery 链接就直接 d("a")啊,xpahth 不是更麻烦
|
5
polythene 2018-02-26 20:04:42 +08:00
恕我直言,要不是有人指出来,你会提出这是“参考”了 kenneth 大神的项目吗?
|
7
prasanta OP @polythene 我是先给他提了 issue,对于 html 和 element 作为共同的东西看待,他没回复,我就实现了一个。
|
8
prasanta OP 提 pr 改动太大。
|