3.4. WITH RECURSIVE

As of 2020-02-20, BigQuery does not support WITH RECURSIVE syntax. See https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#with-clause. However, with BigQuery scripting, it is possible to work around this limitation with SQL that is not as terse but clearer.

The WITH RECURSIVE syntax is useful for listing ancestors in a tree structure. The table will have a primary key that uniquely identifies each node. The table may have a field that identifies zero or one parent nodes. The table may have a field (possibly an array) that identifies zero, one, or more child nodes. There will be one or more root nodes where a field identifying the parent node contains a value that by some convention is different from the primary key value for another node. It could be null or the same as the primary key value for the node in question or some other value that is guaranteed not to be a primary key value (e.g., 0, -1, etc.). If there is a field identifying child nodes, then a leaf node will indicate that it has no children (e.g., an empty array). Some examples of tree structure tables:

employees

Each row corresponds to an employee. There is a field to uniquely identify the employee. And a field to identify a single manager over the employee. The CEO is a single root node.

tables and views

Each row corresponds to a table or view. There is a field to uniquely identify the table or view. And a field to identify zero, one, or more tables or views that a given view targets directly. Tables are leaf nodes but may be orphans. For a view, each targeted table or view is a child node. A view may be a root node. It may be a child node.

The SQL below creates a table in which the rows are related in a tree structure. In general, the values for a field like name would be arbitrary (e.g., Tom, Dick, and Harry). But here, I have chosen to make the names reflect the tree structure to make the intended relationships clearer. In this example, null is used for the parent_id of the root node.

create or replace table `recursive.tree`
as
select
    *
from unnest(
    [
        struct<id int64, name string, parent_id int64>(1, '1', null)
        ,struct<id int64, name string, parent_id int64>(2, '1-1', 1)
        ,struct<id int64, name string, parent_id int64>(3, '1-2', 1)
        ,struct<id int64, name string, parent_id int64>(4, '1-3', 1)
        ,struct<id int64, name string, parent_id int64>(5, '1-1-1', 2)
        ,struct<id int64, name string, parent_id int64>(6, '1-2-1', 3)
        ,struct<id int64, name string, parent_id int64>(7, '1-2-2', 3)
    ]
) x
;

The SQL below generates a row for each id and ancestor.

declare prior_ancestor_generation int64 default 1;

create temp table ancestors
-- depending on size of data, partitioning by ancestor_generation
-- might improve performance
partition by
    range_bucket(
        ancestor_generation
        ,generate_array(
            0   -- minimum ancestor_generation
            ,50 -- data-dependent maximum ancestor_generation
            ,1  -- bucket steps
        )
    )
as
select
    t.id
    ,t.parent_id as ancestor_id
    ,case
        -- root condition
        when t.parent_id is null then 0
        -- not the root condition
        else 1
    end as ancestor_generation
from `recursive.tree` t
;

loop
    insert into ancestors
    select
        pa.id
        ,ca.parent_id as ancestor_id
        ,pa.ancestor_generation + 1 as ancestor_generation
    -- prior ancestor 
    from ancestors pa
    -- current ancestor
    join `recursive.tree` ca
    on
        ca.id = pa.ancestor_id
    where
        -- only include rows for prior generation
        pa.ancestor_generation = prior_ancestor_generation
        -- exclude root condition
        and ca.parent_id is not null
    ;
    
    if @@row_count = 0
    then
        break;
    end if
    ;
    
    set prior_ancestor_generation = prior_ancestor_generation + 1;
end loop
;

select
    a.id
    ,t.name
    ,a.ancestor_generation
    ,a.ancestor_id
    ,ta.name as ancestor_name
from ancestors a
join `recursive.tree` t
on
    t.id = a.id
-- must be a left join, not inner join,
-- if root parent_id is not an id in tree
left join `recursive.tree` ta
on
    ta.id = a.ancestor_id
order by
    a.id
    ,a.ancestor_generation
;

Results are below.

idnameancestor_generationancestor_idancestor_name
110nullnull
21-1111
31-2111
41-3111
51-1-1121-1
51-1-1211
61-2-1131-2
61-2-1211
71-2-2131-2
71-2-2211