请教用 BeautifulSoup 和正则表达式爬取网站的两个问题

1、代码一的问题 
from urllib.request import urlopen 
from bs4 import BeautifulSoup  
import re 
  

def getLinks(articleUrl): 
    html = urlopen("http://www.ccb.com/cn/home"+articleUrl)  
    bsObj = BeautifulSoup(html,'lxml') 
    print("bsObj=",bsObj) 
    return bsObj.find("div", {"class":"copy-right_text"}).findAll("span",re.compile("^'手机网站：'(.)*$")) 
  
links = getLinks("/indexv3.html") 
print(links) 


上面的代码一是用于爬“ http://www.ccb.com/cn/home/indexv3.html ” 这个网址底部“手机网站”栏位显示的网址，打印 BeautifulSoup(html,'lxml') 返回的对象时发现“手机网站”这几个字并未出现。 
通过查看网站源，发现程序未显示的内容都是在网站接近末尾处的这句话之后：“<!--底栏下面不可跳转部分-->”，这句话之后的网站源码无法被 BeautifulSoup.find 搜索到， 
请问这是为什么呢？要如何才能查到呢？感谢！ 


2、代码二的问题 
from urllib.request import urlopen 
from bs4 import BeautifulSoup  
import re 
  
def getLinks(articleUrl): 
    html = urlopen("http://www.ccb.com/cn/home"+articleUrl) 
    bsObj = BeautifulSoup(html,'lxml') 
    print('bsObj.find=',bsObj.find("div", {"class":"Language_select"})) 
    return bsObj.find("div", {"class":"Language_select"}).findAll("a",href=re.compile("\"( http://.* )\">繁体")) 

links = getLinks("/indexv3.html") 
print('links=',links) 

上面的代码二是用于爬“ http://www.ccb.com/cn/home/indexv3.html ” 这个网址底部繁体网站的域名，程序输出如下： 

bsObj.find= 
 
http://fjt.ccb.com ">繁体 /http://en.ccb.com/en/home/indexv3.html ">ENGLISH 


links= [] 

输出中的 http://fjt.ccb.com 就是希望提取的结果，但是为何最终打印 links 却没有内容呢？恳请指点！感谢！

import

links

urlopen

1 条回复 • 2017-09-01 00:28:38 +08:00

saximi

2017-09-01 00:28:38 +08:00

自己顶一下