V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
shuaishuaivip
V2EX  ›  程序员

Hive SQL 实现问题,请大佬看下~

  •  
  •   shuaishuaivip · 279 天前 · 1434 次点击
    这是一个创建于 279 天前的主题,其中的信息可能已经有所发展或是发生改变。
    Hive SQL 实现问题:数据展示如下:
    第一条数据的 end 等于第二条数据的 start 就认为是连续的,一直到 LS,5,6,2024-01-27 08:44:37 2024-01-27 08:44:47 这条数据就断开了,我想取第一条的 start 和断开那条数据的 end 那就查询到的数据结果是这样的,该怎么写?
    name1,start,end,start_time,end_time
    LSJA24390MS139778 55.8 56.6 2024-01-27 08:44:23 2024-01-27 08:46:51
    LSJA24390MS139778 56.7 57.3 2024-01-27 08:46:53 2024-01-27 08:48:35
    LSJA24390MS139778 57.2 57.3 2024-01-27 08:48:37 2024-01-27 08:48:47

    name1,start,end,start_time,end_time
    LSJA24390MS139778 55.8 55.8 2024-01-27 08:44:23 2024-01-27 08:44:23
    LSJA24390MS139778 NULL 55.9 2024-01-27 08:44:25 2024-01-27 08:44:35
    LSJA24390MS139778 NULL 55.9 2024-01-27 08:44:37 2024-01-27 08:44:47
    LSJA24390MS139778 NULL 56 2024-01-27 08:44:49 2024-01-27 08:44:51
    LSJA24390MS139778 NULL 56.1 2024-01-27 08:44:53 2024-01-27 08:45:11
    LSJA24390MS139778 NULL 56.1 2024-01-27 08:45:13 2024-01-27 08:45:21
    LSJA24390MS139778 NULL 56.1 2024-01-27 08:45:23 2024-01-27 08:45:23
    LSJA24390MS139778 NULL 56.3 2024-01-27 08:45:25 2024-01-27 08:45:51
    LSJA24390MS139778 NULL 56.5 2024-01-27 08:45:53 2024-01-27 08:46:21
    LSJA24390MS139778 NULL 56.5 2024-01-27 08:46:23 2024-01-27 08:46:23
    LSJA24390MS139778 NULL 56.6 2024-01-27 08:46:25 2024-01-27 08:46:51
    LSJA24390MS139778 56.7 56.7 2024-01-27 08:46:53 2024-01-27 08:46:59
    LSJA24390MS139778 NULL 56.8 2024-01-27 08:47:01 2024-01-27 08:47:11
    LSJA24390MS139778 NULL 56.8 2024-01-27 08:47:13 2024-01-27 08:47:21
    LSJA24390MS139778 NULL 56.8 2024-01-27 08:47:23 2024-01-27 08:47:23
    LSJA24390MS139778 NULL 56.9 2024-01-27 08:47:25 2024-01-27 08:47:35
    LSJA24390MS139778 NULL 57 2024-01-27 08:47:37 2024-01-27 08:47:47
    LSJA24390MS139778 NULL 57 2024-01-27 08:47:49 2024-01-27 08:47:51
    LSJA24390MS139778 NULL 57 2024-01-27 08:47:53 2024-01-27 08:47:59
    LSJA24390MS139778 NULL 57.1 2024-01-27 08:48:01 2024-01-27 08:48:11
    LSJA24390MS139778 NULL 57.2 2024-01-27 08:48:13 2024-01-27 08:48:21
    LSJA24390MS139778 NULL 57.2 2024-01-27 08:48:23 2024-01-27 08:48:23
    LSJA24390MS139778 NULL 57.3 2024-01-27 08:48:25 2024-01-27 08:48:35
    LSJA24390MS139778 57.2 57.3 2024-01-27 08:48:37 2024-01-27 08:48:47
    11 条回复    2024-02-01 09:31:42 +08:00
    moyi97
        1
    moyi97  
       279 天前
    rownumber 确定第一条, lag 或者 lead 做个判断 取断掉那条, 应该就可以实现了吧
    Jinnrry
        2
    Jinnrry  
       279 天前 via Android
    你这个第一条 end 和第二条 start 也不相等啊,这不差了 2 秒吗?

    如果开始结束时间完全一致的时候,先分列,把 starttime 和 endtime 合并成一个字段,然后计数,如果某个时间点计数值是 1 ,那就说明不连续。
    shuaishuaivip
        3
    shuaishuaivip  
    OP
       279 天前
    Hive SQL 实现问题:数据展示如下:

    name1,start,end,start_time,end_time
    LSJA24390MS139778 55.8 55.8 ,2024-01-27 08:44:23 2024-01-27 08:44:23
    LSJA24390MS139778 55.8 55.9, 2024-01-27 08:44:25 2024-01-27 08:44:35
    LSJA24390MS139778 55.9 55.9 ,2024-01-27 08:44:37 2024-01-27 08:44:47
    LSJA24390MS139778 55.9 56 , 2024-01-27 08:44:49 2024-01-27 08:44:51
    LSJA24390MS139778 56 56.1, 2024-01-27 08:44:53 2024-01-27 08:45:11
    LSJA24390MS139778 56.1 56.1, 2024-01-27 08:45:13 2024-01-27 08:45:21
    LSJA24390MS139778 56.1 56.1, 2024-01-27 08:45:23 2024-01-27 08:45:23
    LSJA24390MS139778 56.1 56.3, 2024-01-27 08:45:25 2024-01-27 08:45:51
    LSJA24390MS139778 56.3 56.5, 2024-01-27 08:45:53 2024-01-27 08:46:21
    LSJA24390MS139778 56.5 56.5, 2024-01-27 08:46:23 2024-01-27 08:46:23
    LSJA24390MS139778 56.5 56.6, 2024-01-27 08:46:25 2024-01-27 08:46:51
    LSJA24390MS139778 56.7 56.7, 2024-01-27 08:46:53 2024-01-27 08:46:59
    LSJA24390MS139778 56.7 56.8, 2024-01-27 08:47:01 2024-01-27 08:47:11
    LSJA24390MS139778 56.8 56.8, 2024-01-27 08:47:13 2024-01-27 08:47:21
    LSJA24390MS139778 56.8 56.8, 2024-01-27 08:47:23 2024-01-27 08:47:23
    LSJA24390MS139778 56.8 56.9, 2024-01-27 08:47:25 2024-01-27 08:47:35
    LSJA24390MS139778 56.9 57, 2024-01-27 08:47:37 2024-01-27 08:47:47
    LSJA24390MS139778 57 57, 2024-01-27 08:47:49 2024-01-27 08:47:51
    LSJA24390MS139778 57 57, 2024-01-27 08:47:53 2024-01-27 08:47:59
    LSJA24390MS139778 57 57.1, 2024-01-27 08:48:01 2024-01-27 08:48:11
    LSJA24390MS139778 57.1 57.2, 2024-01-27 08:48:13 2024-01-27 08:48:21
    LSJA24390MS139778 57.2 57.2, 2024-01-27 08:48:23 2024-01-27 08:48:23
    LSJA24390MS139778 57.2 57.3, 2024-01-27 08:48:25 2024-01-27 08:48:35
    LSJA24390MS139778 57.2 57.3, 2024-01-27 08:48:37 2024-01-27 08:48:47
    实现成这样
    name1,start,end,start_time,end_time
    LSJA24390MS139778 55.8 56.6 2024-01-27 08:44:23 2024-01-27 08:46:51
    LSJA24390MS139778 56.7 57.3 2024-01-27 08:46:53 2024-01-27 08:48:35
    LSJA24390MS139778 57.2 57.3 2024-01-27 08:48:37 2024-01-27 08:48:47
    shuaishuaivip
        4
    shuaishuaivip  
    OP
       279 天前
    start,和 end 说的不是时间哦
    ericzy
        5
    ericzy  
       279 天前 via Android
    -- 创建一个临时表,用 lag 函数获取上一行的 end 值
    with temp as (
    select name1, start, end, start_time, end_time,
    lag(end) over (partition by name1 order by start_time) as prev_end
    from your_table
    )
    -- 从临时表中筛选出断开的数据,即 start 不等于 prev_end 的数据
    select name1, start, end, start_time, end_time
    from temp
    where start != prev_end or prev_end is null

    -- 来自 gpt ,仔细查看和使用
    shuaishuaivip
        6
    shuaishuaivip  
    OP
       279 天前
    @ericzy gpt 我都翻烂了~行不通
    BeiChuanAlex
        7
    BeiChuanAlex  
       279 天前
    一张表解决不了就 2 张表分层实现
    hpkaiq
        8
    hpkaiq  
       279 天前
    with t2 as (
    select *,
    lead(`start`) over (partition by name order by `end_time`) as next_start from your_table
    )
    ,t3 as (
    select name,start_time,end_time
    from t2 where `end` != next_start or next_start is null
    )
    ,t4 as (select * ,lag(`end_time`) over (partition by name order by `end_time`) as pre_end_time
    from t3)

    select name,step_end_time,
    min(`start`) `start`,max(`end`) `end`,
    min(start_time) start_time,max(end_time) end_time
    from
    (
    select t2.*,step_end_time from
    (
    select name,end_time step_end_time,if(pre_end_time is null,'1970-01-01 00:00:00',pre_end_time) step_start_time from t4
    )step
    join
    t2 on step.name = t2.name and t2.end_time <= step_end_time and t2.start_time > step_start_time
    )tmp
    group by name,step_end_time
    order by name,step_end_time

    小试一把,不知道行不行
    hero1874
        9
    hero1874  
       278 天前
    st 就是 start ed 是 end
    我是用 doris 写的 所以下面 0 在 hive 里应该是 null 把判断是否为 0 改成是否为 null

    第一步 用 lag 函数取出上一个的 ed 值
    select
    name,
    st,
    ed,
    start_time,
    end_time,
    lag(ed,1,0) over( partition by name order by start_time ) lag_ed1 -- 上一行的 ed 值
    from form6
    order by start_time

    第二步
    select
    t1.*,
    sum(
    case
    when lag_ed1 = 0 then 1 -- 第一行
    when lag_ed1 !=st then 1 -- 断开了
    when lag_ed1 = st then 0 -- 连续
    end )
    over( partition by name order by start_time) num
    from
    (select
    name,
    st,
    ed,
    start_time,
    end_time,
    lag(ed,1,0) over( partition by name order by start_time ) lag_ed1
    from form6
    order by start_time) t1
    order by start_time

    第三步 根据 name 和判断是否连续或者断开分组取数
    select
    name,
    min(st) st,
    max(ed) ed,
    min(start_time)start_time,
    max(end_time) end_time
    from
    (
    select
    t1.*,
    sum(
    case
    when lag_ed1 = 0 then 1
    when lag_ed1 !=st then 1
    when lag_ed1 = st then 0
    end )
    over( partition by name order by start_time) num
    from
    (select
    name,
    st,
    ed,
    start_time,
    end_time,
    lag(ed,1,0) over( partition by name order by start_time ) lag_ed1
    from form6
    order by start_time) t1
    order by start_time
    )t3
    group by name ,num
    lusansan
        10
    lusansan  
       278 天前
    select
    name1
    ,MIN(start_number) as start_number
    ,MAX(end_number) as end_number
    ,MIN(start_time) as start_time
    ,MAX(end_time) as end_time
    from
    (
    SELECT
    *
    ,case when start_number is null then row_number() over (partition by name1 order by start_time) -1 else row_number() over (partition by name1 order by start_time) end as rn
    from
    (
    SELECT
    name1
    ,start_number
    ,end_number
    ,start_time
    ,end_time
    ,lead(start_number,1) over (partition by name1 order by start_time) lead_start_number
    FROM
    (
    select
    name1,start_number,end_number,start_time,end_time
    from
    values
    ('LSJA24390MS139778' ,55.8,55.8 ,'2024-01-27 08:44:23' ,'2024-01-27 08:44:23')
    ,('LSJA24390MS139778' ,NULL,55.9 ,'2024-01-27 08:44:25' ,'2024-01-27 08:44:35')
    ,('LSJA24390MS139778' ,NULL,55.9 ,'2024-01-27 08:44:37' ,'2024-01-27 08:44:47')
    ,('LSJA24390MS139778' ,NULL,56 ,'2024-01-27 08:44:49' ,'2024-01-27 08:44:51')
    ,('LSJA24390MS139778' ,NULL,56.1 ,'2024-01-27 08:44:53' ,'2024-01-27 08:45:11')
    ,('LSJA24390MS139778' ,NULL,56.1 ,'2024-01-27 08:45:13' ,'2024-01-27 08:45:21')
    ,('LSJA24390MS139778' ,NULL,56.1 ,'2024-01-27 08:45:23' ,'2024-01-27 08:45:23')
    ,('LSJA24390MS139778' ,NULL,56.3 ,'2024-01-27 08:45:25' ,'2024-01-27 08:45:51')
    ,('LSJA24390MS139778' ,NULL,56.5 ,'2024-01-27 08:45:53' ,'2024-01-27 08:46:21')
    ,('LSJA24390MS139778' ,NULL,56.5 ,'2024-01-27 08:46:23' ,'2024-01-27 08:46:23')
    ,('LSJA24390MS139778' ,NULL,56.6 ,'2024-01-27 08:46:25' ,'2024-01-27 08:46:51')
    ,('LSJA24390MS139778' ,56.7,56.7 ,'2024-01-27 08:46:53' ,'2024-01-27 08:46:59')
    ,('LSJA24390MS139778' ,NULL,56.8 ,'2024-01-27 08:47:01' ,'2024-01-27 08:47:11')
    ,('LSJA24390MS139778' ,NULL,56.8 ,'2024-01-27 08:47:13' ,'2024-01-27 08:47:21')
    ,('LSJA24390MS139778' ,NULL,56.8 ,'2024-01-27 08:47:23' ,'2024-01-27 08:47:23')
    ,('LSJA24390MS139778' ,NULL,56.9 ,'2024-01-27 08:47:25' ,'2024-01-27 08:47:35')
    ,('LSJA24390MS139778' ,NULL,57 ,'2024-01-27 08:47:37' ,'2024-01-27 08:47:47')
    ,('LSJA24390MS139778' ,NULL,57 ,'2024-01-27 08:47:49' ,'2024-01-27 08:47:51')
    ,('LSJA24390MS139778' ,NULL,57 ,'2024-01-27 08:47:53' ,'2024-01-27 08:47:59')
    ,('LSJA24390MS139778' ,NULL,57.1 ,'2024-01-27 08:48:01' ,'2024-01-27 08:48:11')
    ,('LSJA24390MS139778' ,NULL,57.2 ,'2024-01-27 08:48:13' ,'2024-01-27 08:48:21')
    ,('LSJA24390MS139778' ,NULL,57.2 ,'2024-01-27 08:48:23' ,'2024-01-27 08:48:23')
    ,('LSJA24390MS139778' ,NULL,57.3 ,'2024-01-27 08:48:25' ,'2024-01-27 08:48:35')
    ,('LSJA24390MS139778' ,57.2,57.3 ,'2024-01-27 08:48:37' ,'2024-01-27 08:48:47') as t (name1,start_number,end_number,start_time,end_time)
    ) p1
    ) p2
    WHERE
    start_number is not null or lead_start_number is not null
    ) t1
    group by name1,rn
    ;



    name1 start_number end_number start_time end_time
    LSJA24390MS139778 55.8 56.6 2024-01-27 08:44:23 2024-01-27 08:46:51
    LSJA24390MS139778 56.7 57.3 2024-01-27 08:46:53 2024-01-27 08:48:35
    LSJA24390MS139778 57.2 57.3 2024-01-27 08:48:37 2024-01-27 08:48:47
    shuaishuaivip
        11
    shuaishuaivip  
    OP
       278 天前
    @lusansan 大佬,这个可行。
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   3150 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 27ms · UTC 12:28 · PVG 20:28 · LAX 04:28 · JFK 07:28
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.