4.2.4. Troubleshooting Queries - Checklist

Prerequisites:

Instructions for troubleshooting a query:

  1. Check the BigQuery performance for the organization, reservation, and/or project.

    Use the queries in Section 6.5, “Projects Query Jobs Elapsed Time Analysis” to evaluate organization, reservation, and/or project performance based on the elapsed query times of recent queries and matching prior queries.

  2. Check the job-level information.

    File job.csv can be examined in a spreadsheet application.

    1. Check job state and enhanced state.

      Note that the state and enhanced state may impact the availability and accuracy of other fields.

    2. Check cache hit.

      If cacheHit is true, then BigQuery was able to respond to the query with results from a prior query job. No slots were consumed.

    3. Check priority.

      If priority is BATCH, then BigQuery could take a long time to start the job, depending on available slots. See https://cloud.google.com/bigquery/docs/running-queries#batch.

    4. Check for errors.

      If Google field error_reason is not null, then an error occurred. See also Google field error_message. Other error fields also exist. Some types of errors are registered before slots can be consumed. Some types of error are registered after slots have been consumed.

    5. Check if simple, parent, or child job.

      See https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting

      A simple job is unrelated to another job.

      A parent job will have statementType set to SCRIPT. Additional comments on fields:

      creationTime , startTime

      Normal value you would expect.

      endTime

      The parent job ends after the last child job.

      totalSlotMs

      If in-flight, then null.

      If DONE, then sum of the child jobs totalSlotMs or null if no child jobs consumed slots.

      Note that, if you include parent jobs and child jobs in an aggregate, you will double count the actual number of slot milliseconds consumed.

      numChildJobs

      If in-flight, then null.

      If DONE, then number of child jobs or null if no child executed.

      job_stages , timeline

      Zero length array.

      reservation_id

      null.

      A child job will have parentJobId set to the job ID of the parent job. Most fields are treated just like fields for a simple query job. Additional comments on fields:

      query (query text)

      Portion of a script that is a DML or DDL statement.

      set statements with a SELECT statement will result in child query text that contains two semicolon-delimited statements:

      • CREATE TEMP FUNCTION

      • SELECT statement that wraps the given SELECT statement

      jobId

      'script_job_' + long random-looking hexadecimal number + '_' + zero-based child number

    6. Check for long delays, when no slots are assigned to a stage.

      The PENDING interval is a delay between the job creationTime and job startTime. For INTERACTIVE queries this delay is almost always small, typically, less than 1 second. It is highly unusual for this delay to last longer than a couple minutes. Derived field pending_ms indicates the length of this delay.

      The UNSTAGED interval is a delay between the job startTime and stages_start_time. This delay can be quite large, possibly hours. While a job is in-flight, it can be determined to be UNSTAGED if the job state is RUNNING (or equivalently, there is some value for startTime) and there are no job_stages elements (or equivalently, derived field stages_count is zero). The difference between the current time and the startTime would be the current delay. After a query job is DONE, derived field unstaged_ms indicates the length of this delay.

      The zero, one, or more INACTIVE intervals is a delay between job stages. This delay can be quite large, possibly hours. While a job is in-flight, it can be determined to be INACTIVE (or possibly ENDING) if there are some number of COMPLETE stages (derived field stages_complete_count is greater than zero) but no RUNNING stages (stages_running_count is zero). There is currently no good way of measuring how long this delay is just by looking at the job information. And there is no good way of determining if the job was INACTIVE previously. After a query job is DONE, derived field staged_inactive_ms indicates the length of these delays.

      The ENDING interval is a delay between the stages_end_time and job endTime. This delay is almost always small, typically, less than 1 second. For most query jobs, it is unusual for this delay to last longer than a few minutes. For DELETE, MERGE, and UPDATE query jobs, it is possible for this delay to be hours. While a job is in-flight, it can be determined to be ENDING (or possibly INACTIVE) if there are some number of COMPLETE stages but no RUNNING stages. There is currently no good way of measuring how long this delay is just by looking at the job information. After a query job is DONE, derived field ending_ms indicates the length of this delay.

      Factors involved in delays may include:

      • Number of slots assigned to the reservation containing the project to which the job is billed.

      • In the billing project, number of query stages simultaneously requesting slots and the number of slots requested.

      • In the reservation containing the project to which the job is billed, number of projects requesting slots and the number of slots requested.

      • In the organization, number of query stages simultaneously requesting slots and the number of slots requested. If you are lucky, your job may automatically be assigned some number of slots borrowed from another reservation.

      For a description of BigQuery slot pricing and scheduling, see:

      Some options to consider:

      • Learn to live with some possible delay.

      • Work with teammates to schedule query jobs to avoid contention for slots in the billing project and the reservation.

      • It might be possible to optimize a query or the targeted views or tables to reduce the amount of resources the query requires.

      • Obtain more slots for your project.

        Some possibilities:

        • Have your project assigned to a new or different reservation.

        • Pay for additional slots for the reservation containing your project.

        • Change your project from using flat pricing to using on-demand pricing.

    7. Check statistics for the longest stage.

      These fields are only valid after the query job is DONE. Google fields max_elapsed_ms_name, max_elapsed_ms_start_time, and max_elapsed_ms_end_time and derived field max_elapsed_ms are shown for the job stage with maximum elapsed_ms. These fields can indicate if one stage uses a disproportionate fraction of the job time. You may also want to begin job stage troubleshooting with this stage.

    8. Check wait statistics.

      These fields are only valid after the query job is DONE. A job stage slot is waiting for some interval of time when it is unable to execute useful work (read, compute, or write). This may be because of I/O limitations or time spent waiting on Google's shuffle service. A high value for a wait statistic may indicate a problem. See derived fields max_wait_ms_max and sum_wait_ms_max.

    9. Check compute skew.

      These fields are only valid after the query job is DONE. Google fields max_significant_compute_name, max_significant_compute_ms_avg, and max_significant_compute_ms_max and derived field max_significant_compute_skew are shown for the job stage with maximum significant compute_ms_max. Significant job stages have compute_ms_max >= threshold_compute_ms_max and compute_skew >= threshold_compute_skew. If Google field compute_ms_max is large (many minutes) and derived field compute_skew is large (approaching 1), this indicates a skew problem, that a small number of the total number of slots are responsible for a large amount of the work done. Look into ways to reduce skew. TBD...

    10. Check the elapsed time during which BigQuery is actively working on the query and check total slot milliseconds.

      Derived field staged_active_ms is a sum of the elapsed time of all stage islands, where a stage island is an interval containing one or more contiguous or overlapping stages. So, this field is a measure of how many seconds one or more slots was allocated. This type of measure is notoriously inconsistent because system load can easily cause it to vary between two jobs with the same query targeting the same data.

      Google field total_slot_ms is a sum of slot milliseconds consumed by the query. For various reasons, this measure can vary by as much as a factor of two or more between two jobs with the same query targeting the same data. For example, a common strategy for evaluating two different query strategies is to execute the same queries 10 times in approximately the same timeframe and consider the average slot milliseconds of each of the two queries.

    11. Check total bytes processed, shuffle bytes, and records written.

      Google field total_bytes_processed reflects how many bytes were read for the query job. Derived field shuffle_output_bytes is the sum of shuffle_output_bytes over all job stages. Derived field last_records_written is the records_written of the last stage. So, these fields represent the input, intermediate output, and final output of the query job. And they can be useful for gauging the performance of one run of a query to similar prior runs of the same query.

      Derived field shuffle_output_bytes_spilled is the sum of shuffle_output_bytes_spilled over all job stages. This measures how many bytes in shuffle had to be spilled to storage instead of remaining in memory. Read and writes to storage are considerably slower than to memory. So, if this value is large, the job may take longer than otherwise.

  3. Check the job stage-level information.

    File job_stages.csv can be examined in a spreadsheet application.

    If the job stage-level information is computed using Section 6.3, “Query Job Stages”, there will be fields pertaining to primary job stages and related job stages. Refer to that section for further details.

    1. Check job stage status.

      Check job stage status field.

    2. Check parallel inputs fields.

      While in-flight, you can gauge the progress of RUNNING stages by looking at Google fields completed_parallel_inputs and parallel_inputs and derived field completed_parallel_inputs_percent. Google indicates that it is possible that not all inputs within a stage need to be completed for the stage to complete. For more information, see https://cloud.google.com/bigquery/query-plan-explanation#stage_overview.

    3. Check the elapsed time during which BigQuery is actively working on the query and check total slot milliseconds.

      Especially, when there is a large number of stages, focusing on those stages with high elapsed milliseconds and/or high slot milliseconds can speed troubleshooting. Note that the root cause of the high cost of a given stage may be found in a prior stage. So, you may need to carefully backtrack some number of stages to find the root cause. Google fields steps and input_stages can be used for this.

      Another thing to consider is that, if the stages consuming the vast majority of elapsed milliseconds and/or slot milliseconds can be reasonably confirmed to be operating correctly and efficiently, then any attempt to optimize other stages will have only a very small impact on overall cost. For example, if 99% of the cost of the query is found to reside in the final output stage and the query results are verified to be correct, then any optimization attempts can impact no more than 1% of the cost of the query.

      Derived field elapsed_ms is the difference between Google fields end_ms and start_ms.

      Derived field related_slot_ms is a sum of slot milliseconds consumed by the related stages.

    4. Check for large wait statistics.

      These fields are only valid after the query job is DONE. A job stage slot is waiting for some interval of time when it is unable to execute useful work (read, compute, or write). This may be because of I/O limitations or time spent waiting on Google's shuffle service. A high value for a wait statistic may indicate a problem. See Google fields wait_ms_avg and wait_ms_max.

    5. Check for large shuffle bytes spilled.

      A stage with a large shuffle_output_bytes_spilled may cost more in elapsed time and slot milliseconds. If shuffle_output_bytes_spilled is large, then there may also be large wait_ms_avg and wait_ms_max. There could be an underlying problem with the query or the shuffle service. Or it could just be that there are too many queries contending for reservation shuffle resources at the same time.

    6. Check for large compute_ms_max and compute skew.

      These fields are only valid after the query job is DONE. If Google field compute_ms_max is large (many minutes) and derived field compute_skew is large (approaching 1), this indicates a skew problem, that a small number of the total number of slots are responsible for a large amount of the work done. Look into ways to reduce skew. TBD...

    7. Check that partition elimination occurs as expected.

      Google field steps and derived field read_from will indicate any tables that may be read from in the stage. For a partitioned table, you expect the READ step for the table to indicate filtering on the partitioning field. Also, Google field records_read indicates the sum of the number of records read from each table and spool referenced in the stage. So, if you know the total number of records in the table, you may be able to infer from this field whether or not partition elimination has occurred.

      For additional information about partition elimination, see Section 4.1, “Partition Elimination”.

    8. Check for extraneous data.

      The smaller the amount of data input to a stage, the less the cost of the subsequent operations. So, reduce the data as much as possible and reduce the data as early as possible in the query plan. For example, by filtering significantly in a READ step, you can reduce the cost of a subsequent JOIN step.

      Where possible, reduce both the number of fields and the number of rows. Check the query text. Check Google fields recordsRead and recordsWritten for each stage.

    9. Check join conditions.

      A common mistake in queries is that join conditions are incorrect. In addition to yielding incorrect results, this can cause performance problems.

      Inspect the query text to see if join conditions are as intended and/or make sense.

      In some cases, you may be missing one or more join conditions.

      In some cases, you may be joining to a data at the wrong grain. For example, maybe for the query to make sense, you should join to data at a (date, location) grain. But you are joining data that is at a (date, location, sublocation) grain. Maybe you need to aggregate the data to (date, location) grain before joining. Or maybe you need to somehow select rows with filtering to ensure the data is at (date, location) grain prior to joining.

      Check Google field recordsWritten to see if the result of the join makes sense.

      You can use Section 6.1, “Primary Key Test” to see whether or not the join results violate the expected primary key. You can also use the same section to test separately the primary key of each table that is joined.

    10. Check join strategies.

      Check that BigQuery is choosing efficient join strategies given the nature of the data.

      One join strategy is to distribute rows of a large table (or spool) across slots and copy an entire small table (or spool) to each slot. If this is the case, the Google field steps will indicate an EACH WITH ALL join. This can also be referred to as a broadcast join. This is only possible if the subset of the small table (or spool) is small enough to fit in the memory of each slot alongside some portion of the large table (or spool).

      The other join strategy is to distribute rows of both tables (or spools) by a hash of the join condition values across slots. If this is the case, the Google field steps will indicate an EACH WITH EACH join.