不懂就问，我有一个题库的字典，获取某一个题目时（id），使用 for 循环和 pandas 效率是差不多了的，没有提升。

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 1959 days ago, the information mentioned may be changed or developed.

题库字典大概是{一级标题：{
二级标题：{
sublist：[{id:1,
题目：xx，
答案：xx}]
}
}
}
总的题目大概有 12k 条，实际字段要比这个多一些，但是结构一样

语言 python，或者说有没有其他更好的方法

题库

题目

字典

标题

14 replies • 2021-01-19 15:10:52 +08:00

wuwukai007

Jan 19, 2021 via Android

pandas 加索引了吗

xpresslink

Jan 19, 2021

没有。
你的题目是存放在第三级的 list 中的，只能顺序查找。
除非你自己再建立一套以题目 id 为 key 的索引字典。
或者干脆牺牲空间，直接把题目改成一级字典，每个题目 id 当成 key 一个，一级和二级标题当成每个题目的两个属性。

ClutchBear

Jan 19, 2021

用 mysql 或者 es 呗

sznewbee096

Jan 19, 2021

建议把数据放在数据库，为以后扩容、转移，快速加载准备

princelai

Jan 19, 2021

我试了试，只能循环转为 dataframe，之后查询就会快很多

```
d = {
'中学': {
'初一': {
'数学': [{
'id': 1,
'题目': 'xx',
'答案': 'xx'
},
{
'id': 2,
'题目': 'xx',
'答案': 'xx'
}
]
},
'初三': {
'语文': [{
'id': 3,
'题目': 'xx',
'答案': 'xx'
},
{
'id': 4,
'题目': 'xx',
'答案': 'xx'
}
]
}
},
'小学': {
'三年级': {
'英语': [{
'id': 5,
'题目': 'xx',
'答案': 'xx'
},
{
'id': 6,
'题目': 'xx',
'答案': 'xx'
}
],
'体育': [{
'id': 7,
'题目': 'xx',
'答案': 'xx'
},
{
'id': 8,
'题目': 'xx',
'答案': 'xx'
}
]
},
'五年级': {
'美术': [{
'id': 9,
'题目': 'xx',
'答案': 'xx'
},
{
'id': 10,
'题目': 'xx',
'答案': 'xx'
}
]
}
}
}

trans = []

for title1_key,title1_val in d.items():
for title2_key,title2_val in title1_val.items():
for title3_key, title3_val in title2_val.items():
tmp_df = pd.DataFrame(title3_val)
tmp_df['title1'] = title1_key
tmp_df['title2'] = title2_key
tmp_df['title3'] = title3_key
trans.append(tmp_df)
df = pd.concat(trans)
```

查询的话，大数据量用 query 方法会更快一点

df.query('id==5')
Out[156]:
id 题目答案 title1 title2 title3
0 5 xx xx 小学三年级英语

df.query("title2=='三年级' and title3=='英语'").id
Out[158]:
0 5
1 6

ijustdo

Jan 19, 2021

加索引的思路没错

doc_inx = {123: {'ft': '一年级', 'st': '语文', 'inx': 5}, 321: {'ft': '二年级', 'st': '数学', 'inx': 5}}
建立这样的索引
获取的时候
index_obj = doc_inx[321]
result = 题库字典[index_obj['ft']][index_obj['st']]['sublist'][index_obj['inx']]

imn1

Jan 19, 2021

好奇你的 pandas 是什么结构？

rationa1cuzz

Jan 19, 2021

@princelai 跟你类似的构造结构取值不同，换了 query 查询结果感觉提升也不是很明显，12k 的数据，for 循环取值大概 1.8s pandas 1.2s 不知道正不正常，没有加索引

rationa1cuzz

Jan 19, 2021

@ijustdo 我试试

rationa1cuzz

Jan 19, 2021

这里 for 也是 1.2s 不是 1.8s

princelai

Jan 19, 2021

@rationa1cuzz #8 我自己手上正好有一个 12k 的数据集

data.shape
Out[8]: (116419, 12)

%timeit data.query("startCityId==321 and endCityId==3401")
3.14 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit data.query("startCityId==321 and endCityId==3401 and carType=='8_1'")
12.5 ms ± 7.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

data.startCityId.nunique()
Out[10]: 265
data.endCityId.nunique()
Out[11]: 284

carType 是一个字符串，另两个是整数类型，这么小的数据集查询这么慢，肯定是哪里出问题了

rationa1cuzz

Jan 19, 2021

@princelai 我这个地方看的是接口响应时间并没有去看实际的查询时间，而且经过大量测试发现 for 循环去查找时间竟然比使用 pandas 还要短，我想应该是我哪里出了问题。

princelai

Jan 19, 2021

我用你这样的结构生成了一个 15 万的数据
```
import pandas as pd
from random import randint, choices

opt_t1 = {'小学': ['一年级', '二年级', '三年级', '四年级', '五年级', '六年级'],
'初中': ['初一', '初二', '初三'],
'高中': ['高一', '高二', '高三'],
'大学': ['大一', '大二', '大三', '大四']}

opt_t2 = ['数学', '语文', '英语', '计算机']

t1_list = []
for k in opt_t1.keys():
i = randint(30000, 50000)
tmp = pd.DataFrame({'title2': choices(opt_t1.get(k), k=i), 'title3': choices(opt_t2, k=i)})
tmp['title1'] = k
tmp['题目'] = 'xx'
tmp['答案'] = 'xxx'
t1_list.append(tmp)
df = pd.concat(t1_list)
df['id_num'] = range(1, df.shape[0]+1)
df = df.sample(frac=1)
df.index = range(df.shape[0])

```

结果如下

%timeit df.query("title2=='大二' and title3=='计算机'")
12.5 ms ± 7.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

你这秒级接口通常是在里面处理数据或者有 io 吧

rationa1cuzz

Jan 19, 2021

@princelai 是我的问题，因为这个接口响应时间很慢，第一反应就是查询除了问题，所以没考虑到，我再次看了一下查询时间，随机取了终端几个 id 查找，发现 for 循环大概是 3ms pandas 是 70ms，这个是正常的吗