最近在使用 Pandas 处理 json 数据时遇到了 ValueError: Protocol not known
的问题
后面使用 json 库就解决了,不明白为什么
json 的数据就是 data 里包个 contestUpcomingContests ,里面再包一个数组,内有两个元素
import json
import pandas as pd
data = '{"data":{"contestUpcomingContests":[{"containsPremium":false,"title":"\u7b2c 99 \u573a\u53cc\u5468\u8d5b","cardImg":"https://assets.leetcode.cn/aliyun-lc-upload/contest-config/biweekly-contest-99/contest_detail/pc_card.png","titleSlug":"biweekly-contest-99","startTime":1677940200,"duration":5400,"originStartTime":1677940200},{"containsPremium":false,"title":"\u7b2c 334 \u573a\u5468\u8d5b","cardImg":"https://assets.leetcode.cn/aliyun-lc-upload/contest-config/weekly-contest-334/contest_detail/pc_card.png","titleSlug":"weekly-contest-334","startTime":1677378600,"duration":5400,"originStartTime":1677378600}]}}'
df1 = json.loads(data)
print(df1)
df2 = pd.read_json(data)
print(df2)
{'data': {'contestUpcomingContests': [{'containsPremium': False, 'title': '第 99 场双周赛', 'cardImg': 'https://assets.leetcode.cn/aliyun-lc-upload/contest-config/biweekly-contest-99/contest_detail/pc_card.png', 'titleSlug': 'biweekly-contest-99', 'startTime': 1677940200, 'duration': 5400, 'originStartTime': 1677940200}, {'containsPremium': False, 'title': '第 334 场周赛', 'cardImg': 'https://assets.leetcode.cn/aliyun-lc-upload/contest-config/weekly-contest-334/contest_detail/pc_card.png', 'titleSlug': 'weekly-contest-334', 'startTime': 1677378600, 'duration': 5400, 'originStartTime': 1677378600}]}}
Traceback (most recent call last):
File "/Users/world/Developer/AlgorithmSharkSpider/test.py", line 8, in <module>
df2 = pd.read_json(data)
File "/opt/homebrew/lib/python3.9/site-packages/pandas/util/_decorators.py", line 199, in wrapper
return func(*args, **kwargs)
File "/opt/homebrew/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "/opt/homebrew/lib/python3.9/site-packages/pandas/io/json/_json.py", line 540, in read_json
json_reader = JsonReader(
File "/opt/homebrew/lib/python3.9/site-packages/pandas/io/json/_json.py", line 622, in __init__
data = self._get_data_from_filepath(filepath_or_buffer)
File "/opt/homebrew/lib/python3.9/site-packages/pandas/io/json/_json.py", line 659, in _get_data_from_filepath
self.handles = get_handle(
File "/opt/homebrew/lib/python3.9/site-packages/pandas/io/common.py", line 558, in get_handle
ioargs = _get_filepath_or_buffer(
File "/opt/homebrew/lib/python3.9/site-packages/pandas/io/common.py", line 333, in _get_filepath_or_buffer
file_obj = fsspec.open(
File "/opt/homebrew/lib/python3.9/site-packages/fsspec/core.py", line 419, in open
return open_files(
File "/opt/homebrew/lib/python3.9/site-packages/fsspec/core.py", line 272, in open_files
fs, fs_token, paths = get_fs_token_paths(
File "/opt/homebrew/lib/python3.9/site-packages/fsspec/core.py", line 574, in get_fs_token_paths
chain = _un_chain(urlpath0, storage_options or {})
File "/opt/homebrew/lib/python3.9/site-packages/fsspec/core.py", line 315, in _un_chain
cls = get_filesystem_class(protocol)
File "/opt/homebrew/lib/python3.9/site-packages/fsspec/registry.py", line 208, in get_filesystem_class
raise ValueError("Protocol not known: %s" % protocol)
ValueError: Protocol not known: {"data":{"contestUpcomingContests":[{"containsPremium":false,"title":"第 99 场双周赛","cardImg":"https
1
bomb77 2023-02-23 14:41:24 +08:00
python3.8
pandas 1.5.3 测试没有报错 升级下版本或者看看是不是编码啥的问题? |
2
dcopen 2023-02-23 15:00:19 +08:00 2
这个问题发生在使用 Pandas 的 read_json() 函数时,该函数使用了 fsspec 库进行文件处理和读取。而在 fsspec 0.9.0 版本之后,它引入了一个新的 URL 解析机制,导致了该错误。
在处理 JSON 数据时,您可以使用 json 库将其转换为 Python 字典,然后再使用 Pandas 的 json_normalize() 函数将其展平为 Pandas 数据帧。下面是一个示例代码: ``` import json import pandas as pd data = '{"data":{"contestUpcomingContests":[{"containsPremium":false,"title":"第 99 场双周赛","cardImg":"https://assets.leetcode.cn/aliyun-lc-upload/contest-config/biweekly-contest-99/contest_detail/pc_card.png","titleSlug":"biweekly-contest-99","startTime":1677940200,"duration":5400,"originStartTime":1677940200},{"containsPremium":false,"title":"第 334 场周赛","cardImg":"https://assets.leetcode.cn/aliyun-lc-upload/contest-config/weekly-contest-334/contest_detail/pc_card.png","titleSlug":"weekly-contest-334","startTime":1677378600,"duration":5400,"originStartTime":1677378600}]}}' data_dict = json.loads(data) df = pd.json_normalize(data_dict, record_path=['data', 'contestUpcomingContests']) print(df) ``` 输出: ``` containsPremium title cardImg titleSlug startTime duration originStartTime 0 False 第 99 场双周赛 https://assets.leetcode.cn/aliyun-lc-upload/co... biweekly-contest-99 1677940200 5400 1677940200 1 False 第 334 场周赛 https://assets.leetcode.cn/aliyun-lc-upload/co... weekly-contest-334 1677378600 5400 1677378600 ``` |