本项目写于 2017 年七月初,主要使用 Python 爬取网贷之家以及人人贷的数据进行分析。
网贷之家是国内最大的 P2P 数据平台,人人贷国内排名前二十的 P2P 平台。
源码地址
抓包工具主要使用 chrome 的开发者工具 网络一栏,网贷之家的数据全部是 ajax 返回 json 数据,而人人贷既有 ajax 返回数据也有 html 页面直接生成数据。
从数据中可以看到请求数据的方式( GET 或者 POST ),请求头以及请求参数。 从请求数据中可以看到返回数据的格式(此例中为 json )、数据结构以及具体数据。 注:这是现在网贷之家的 API 请求后台的接口,爬虫编写的时候与数据接口与如今的请求接口不一样,所以网贷之家的数据爬虫部分已无效。
根据抓包分析得到的结果,构造请求。在本项目中,使用 Python 的 requests 库模拟 http 请求 具体代码:
import requests
class SessionUtil():
def __init__(self,headers=None,cookie=None):
self.session=requests.Session()
if headers is None:
headersStr={"Accept":"application/json, text/javascript, */*; q=0.01",
"X-Requested-With":"XMLHttpRequest",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36",
"Accept-Encoding":"gzip, deflate, sdch, br",
"Accept-Language":"zh-CN,zh;q=0.8"
}
self.headers=headersStr
else:
self.headers=headers
self.cookie=cookie
//发送 get 请求
def getReq(self,url):
return self.session.get(url,headers=self.headers).text
def addCookie(self,cookie):
self.headers['cookie']=cookie
//发送 post 请求
def postReq(self,url,param):
return self.session.post(url, param).text
在设置请求头的时候,关键字段只设置了"User-Agent",网贷之家和人人贷的没有反爬措施,甚至不用设置"Referer"字段来防止跨域错误。
以下是一个爬虫实例
import json
import time
from databaseUtil import DatabaseUtil
from sessionUtil import SessionUtil
from dictUtil import DictUtil
from logUtil import LogUtil
import traceback
def handleData(returnStr):
jsonData=json.loads(returnStr)
platData=jsonData.get('data').get('platOuterVo')
return platData
def storeData(jsonOne,conn,cur,platId):
actualCapital=jsonOne.get('actualCapital')
aliasName=jsonOne.get('aliasName')
association=jsonOne.get('association')
associationDetail=jsonOne.get('associationDetail')
autoBid=jsonOne.get('autoBid')
autoBidCode=jsonOne.get('autoBidCode')
bankCapital=jsonOne.get('bankCapital')
bankFunds=jsonOne.get('bankFunds')
bidSecurity=jsonOne.get('bidSecurity')
bindingFlag=jsonOne.get('bindingFlag')
businessType=jsonOne.get('businessType')
companyName=jsonOne.get('companyName')
credit=jsonOne.get('credit')
creditLevel=jsonOne.get('creditLevel')
delayScore=jsonOne.get('delayScore')
delayScoreDetail=jsonOne.get('delayScoreDetail')
displayFlg=jsonOne.get('displayFlg')
drawScore=jsonOne.get('drawScore')
drawScoreDetail=jsonOne.get('drawScoreDetail')
equityVoList=jsonOne.get('equityVoList')
experienceScore=jsonOne.get('experienceScore')
experienceScoreDetail=jsonOne.get('experienceScoreDetail')
fundCapital=jsonOne.get('fundCapital')
gjlhhFlag=jsonOne.get('gjlhhFlag')
gjlhhTime=jsonOne.get('gjlhhTime')
gruarantee=jsonOne.get('gruarantee')
inspection=jsonOne.get('inspection')
juridicalPerson=jsonOne.get('juridicalPerson')
locationArea=jsonOne.get('locationArea')
locationAreaName=jsonOne.get('locationAreaName')
locationCity=jsonOne.get('locationCity')
locationCityName=jsonOne.get('locationCityName')
manageExpense=jsonOne.get('manageExpense')
manageExpenseDetail=jsonOne.get('manageExpenseDetail')
newTrustCreditor=jsonOne.get('newTrustCreditor')
newTrustCreditorCode=jsonOne.get('newTrustCreditorCode')
officeAddress=jsonOne.get('officeAddress')
onlineDate=jsonOne.get('onlineDate')
payment=jsonOne.get('payment')
paymode=jsonOne.get('paymode')
platBackground=jsonOne.get('platBackground')
platBackgroundDetail=jsonOne.get('platBackgroundDetail')
platBackgroundDetailExpand=jsonOne.get('platBackgroundDetailExpand')
platBackgroundExpand=jsonOne.get('platBackgroundExpand')
platEarnings=jsonOne.get('platEarnings')
platEarningsCode=jsonOne.get('platEarningsCode')
platName=jsonOne.get('platName')
platStatus=jsonOne.get('platStatus')
platUrl=jsonOne.get('platUrl')
problem=jsonOne.get('problem')
problemTime=jsonOne.get('problemTime')
recordId=jsonOne.get('recordId')
recordLicId=jsonOne.get('recordLicId')
registeredCapital=jsonOne.get('registeredCapital')
riskCapital=jsonOne.get('riskCapital')
riskFunds=jsonOne.get('riskFunds')
riskReserve=jsonOne.get('riskReserve')
riskcontrol=jsonOne.get('riskcontrol')
securityModel=jsonOne.get('securityModel')
securityModelCode=jsonOne.get('securityModelCode')
securityModelOther=jsonOne.get('securityModelOther')
serviceScore=jsonOne.get('serviceScore')
serviceScoreDetail=jsonOne.get('serviceScoreDetail')
startInvestmentAmout=jsonOne.get('startInvestmentAmout')
term=jsonOne.get('term')
termCodes=jsonOne.get('termCodes')
termWeight=jsonOne.get('termWeight')
transferExpense=jsonOne.get('transferExpense')
transferExpenseDetail=jsonOne.get('transferExpenseDetail')
trustCapital=jsonOne.get('trustCapital')
trustCreditor=jsonOne.get('trustCreditor')
trustCreditorMonth=jsonOne.get('trustCreditorMonth')
trustFunds=jsonOne.get('trustFunds')
tzjPj=jsonOne.get('tzjPj')
vipExpense=jsonOne.get('vipExpense')
withTzj=jsonOne.get('withTzj')
withdrawExpense=jsonOne.get('withdrawExpense')
sql='insert into problemPlatDetail (actualCapital,aliasName,association,associationDetail,autoBid,autoBidCode,bankCapital,bankFunds,bidSecurity,bindingFlag,businessType,companyName,credit,creditLevel,delayScore,delayScoreDetail,displayFlg,drawScore,drawScoreDetail,equityVoList,experienceScore,experienceScoreDetail,fundCapital,gjlhhFlag,gjlhhTime,gruarantee,inspection,juridicalPerson,locationArea,locationAreaName,locationCity,locationCityName,manageExpense,manageExpenseDetail,newTrustCreditor,newTrustCreditorCode,officeAddress,onlineDate,payment,paymode,platBackground,platBackgroundDetail,platBackgroundDetailExpand,platBackgroundExpand,platEarnings,platEarningsCode,platName,platStatus,platUrl,problem,problemTime,recordId,recordLicId,registeredCapital,riskCapital,riskFunds,riskReserve,riskcontrol,securityModel,securityModelCode,securityModelOther,serviceScore,serviceScoreDetail,startInvestmentAmout,term,termCodes,termWeight,transferExpense,transferExpenseDetail,trustCapital,trustCreditor,trustCreditorMonth,trustFunds,tzjPj,vipExpense,withTzj,withdrawExpense,platId) values ("'+actualCapital+'","'+aliasName+'","'+association+'","'+associationDetail+'","'+autoBid+'","'+autoBidCode+'","'+bankCapital+'","'+bankFunds+'","'+bidSecurity+'","'+bindingFlag+'","'+businessType+'","'+companyName+'","'+credit+'","'+creditLevel+'","'+delayScore+'","'+delayScoreDetail+'","'+displayFlg+'","'+drawScore+'","'+drawScoreDetail+'","'+equityVoList+'","'+experienceScore+'","'+experienceScoreDetail+'","'+fundCapital+'","'+gjlhhFlag+'","'+gjlhhTime+'","'+gruarantee+'","'+inspection+'","'+juridicalPerson+'","'+locationArea+'","'+locationAreaName+'","'+locationCity+'","'+locationCityName+'","'+manageExpense+'","'+manageExpenseDetail+'","'+newTrustCreditor+'","'+newTrustCreditorCode+'","'+officeAddress+'","'+onlineDate+'","'+payment+'","'+paymode+'","'+platBackground+'","'+platBackgroundDetail+'","'+platBackgroundDetailExpand+'","'+platBackgroundExpand+'","'+platEarnings+'","'+platEarningsCode+'","'+platName+'","'+platStatus+'","'+platUrl+'","'+problem+'","'+problemTime+'","'+recordId+'","'+recordLicId+'","'+registeredCapital+'","'+riskCapital+'","'+riskFunds+'","'+riskReserve+'","'+riskcontrol+'","'+securityModel+'","'+securityModelCode+'","'+securityModelOther+'","'+serviceScore+'","'+serviceScoreDetail+'","'+startInvestmentAmout+'","'+term+'","'+termCodes+'","'+termWeight+'","'+transferExpense+'","'+transferExpenseDetail+'","'+trustCapital+'","'+trustCreditor+'","'+trustCreditorMonth+'","'+trustFunds+'","'+tzjPj+'","'+vipExpense+'","'+withTzj+'","'+withdrawExpense+'","'+platId+'")'
cur.execute(sql)
conn.commit()
conn,cur=DatabaseUtil().getConn()
session=SessionUtil()
logUtil=LogUtil("problemPlatDetail.log")
cur.execute('select platId from problemPlat')
data=cur.fetchall()
print(data)
mylist=list()
print(data)
for i in range(0,len(data)):
platId=str(data[i].get('platId'))
mylist.append(platId)
print mylist
for i in mylist:
url='http://wwwservice.wdzj.com/api/plat/platData30Days?platId='+i
try:
data=session.getReq(url)
platData=handleData(data)
dictObject=DictUtil(platData)
storeData(dictObject,conn,cur,i)
except Exception,e:
traceback.print_exc()
cur.close()
conn.close
整个过程中 我们 构造请求,然后把解析每个请求的响应,其中 json 返回值使用 json 库进行解析,html 页面使用 BeautifulSoup 库进行解析(结构复杂的 html 的页面推荐使用 lxml 库进行解析),解析到的结果存储到 mysql 数据库中。
爬虫代码地址(注:爬虫使用代码 Python2 与 python3 都可运行,本人把爬虫代码部署在阿里云服务器上,使用 Python2 运行)
数据分析主要使用 Python 的 numpy、pandas、matplotlib 进行数据分析,同时辅以海致 BDP。
一般采取把数据读取 pandas 的 DataFrame 中进行分析。 以下就是读取问题平台的数据的例子
problemPlat=pd.read_csv('problemPlat.csv',parse_dates=True)#问题平台
数据结构
eg 问题平台数量随时间变化
problemPlat['id']['2012':'2017'].resample('M',how='count').plot(title='P2P 发生问题')#发生问题 P2P 平台数量 随时间变化趋势
图形化展示
使用海致 BDP 完成( Python 绘制地图分布轮子比较复杂,当时还未学习)
eg 全国六月平台成交额分布 代码
juneData['amount'].hist(normed=True)
juneData['amount'].plot(kind='kde',style='k--')#六月份交易量概率分布
核密度图形展示 成交额取对数核密度分布
np.log10(juneData['amount']).hist(normed=True)
np.log10(juneData['amount']).plot(kind='kde',style='k--')#取 10 对数的 概率分布
图形化展示 可看出取 10 的对数后分布更符合正常的金字塔形。
lujinData=platVolume[platVolume['wdzjPlatId']==59]
corr=pd.rolling_corr(lujinData['amount'],allPlatDayData['amount'],50,min_periods=50).plot(title='陆金所交易额与所有平台交易额的相关系数变化趋势')
图形化展示
车贷平台与全平台成交额数据对比
carFinanceDayData=carFinanceData.resample('D').sum()['amount']
fig,axes=plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(14,7))
carFinanceDayData.plot(ax=axes[0],title='车贷平台交易额')
allPlatDayData['amount'].plot(ax=axes[1],title='所有 p2p 平台交易额')
lujinAmount=platVolume[platVolume['wdzjPlatId']==59]
lujinAmount['y']=lujinAmount['amount']
lujinAmount['ds']=lujinAmount['date']
m=Prophet(yearly_seasonality=True)
m.fit(lujinAmount)
future=m.make_future_dataframe(periods=365)
forecast=m.predict(future)
m.plot(forecast)
趋势预测图形化展示
数据分析代码地址(注:数据分析代码智能运行在 Python3 环境下) 代码运行后样例(无需安装 Python 环境 也可查看具体代码解图形化展示)
这是本人从 Java web 转向数据方向后自己写的第一项目,也是自己的第一个 Python 项目,在整个过程中,也没遇到多少坑,整体来说,爬虫和数据分析以及 Python 这门语言门槛都是非常低的。
如果想入门 Python 爬虫,推荐《 Python 网络数据采集》
如果想入门 Python 数据分析,推荐 《利用 Python 进行数据分析》
1
superlead 2018-01-25 11:04:05 +08:00
不错 很好~
|