小白求教 举个例子 文件 a 10001 99 10002 20 10003 75 …… 99999 12
文件 b 50000 172 50001 188 …… 149999 130
输出 10001 99 0 10002 20 0 10003 75 0 …… 149999 0 130
大概就是这个意思,现在有好几批几亿条的数据,不用 pandas 、不用数据库,怎么 join 比较快呢?
1
953424918 2016-09-15 15:57:24 +08:00 via Android
用 cat 命令合并两个文件?
|
2
zhuangzhuang1988 2016-09-15 16:14:27 +08:00
yield
|
4
Furylord OP @zhuangzhuang1988 请问具体是?
|
5
daybyday 2016-09-15 19:27:00 +08:00
1. sort -n 将两文件分别按第一列数字排序
2. 分别顺序遍历排序好的两文件的每一行,进行 join :记两文件每一行第一列的数字为 n1, n2, 假如 n1<n2, 则 n1 往下移一行, n2 不动,一直移到 n1 >= n2 为止,此时, n1==n2 则说明需要 join , n1>n2 则反过来下移 n2 ,如此循环 |
6
zhuangzhuang1988 2016-09-15 20:03:25 +08:00
```python
def read_file_gen(name): with open(name, 'r') as fp: for line in fp : yield line def map_gen(from, fn): for item in from: yield fn(item) def merge_gen(from1, from2, choice_fn): gen1 = from1() gen2 = from2() while True: gen1: try: item1 = gen1.next() except StopIteration as e: yield from from2 break gen2: try: item2 = gen2.next(): except StopIteration as e: yield from from1 break item = fn(item1, item2) if item == item1: yield item goto gen1 # Todo 实现不完整 def write_file_gen_stop(from, fname): with open(fname, 'w') as fp: for line in def map_fn(line): score = int(line.split(' ')) return (score, line) def map_fn2(item): return item[1] def choice_fn(item1, item2): if item1[0] > item2[0]: return item2 else: return item1 def _f(n): g_f = read_file(n) return map_gen(g_f, map_fn) g_merge = merge_gen(_f('a'), _f('b'), choice_fn) g_out = map_gen(g_merge, map_fn2) write_file_gen_stop(g_out) ``` 代码尚未完整, 不够大概是这个意思。。, (如果 a , b 是分别有序的话) |
7
Furylord OP @zhuangzhuang1988 非常感谢,等我待会试验一下
|
8
zhuangzhuang1988 2016-09-15 20:18:10 +08:00
|
9
ryd994 2016-09-16 06:26:30 +08:00 via Android
讲真,进数据库只会快………
无论是开发效率,还是运行效率 |