3.2. Character Sets

3.2. Character Sets
Prev	3. Query Syntax	Next

BigQuery natively interprets the STRING data type as a sequence of UTF-8 characters. Each single UTF-8 character uses from one to four 8-bit bytes.

Note that the mapping for all 7-bit ASCII characters is the same in UTF-8. And the mapping for printable 7-bit characters in ISO-8859-1 is also the same as UTF-8. But the 8-bit ISO-8859-1 characters are mapped to different byte sequences in UTF-8. For example, À (uppercase A with a grave accent) is encoded as hexadecmial C0 in ISO-8859-1 and as C380 in UTF-8.

Note that there are character sets that are similar to ISO-8859-1 but not entirely the same. E.g., ISO-8859-15, Windows-1252, etc. Sometimes data using these character sets is mislabled as using the ISO-8859-1 character set.

It is possible to load CSV files in the ISO-8859-1 encoding; the files are translated to UTF-8 automatically. However, JSON (and newline-delimited JSON) files must always be UTF-8 encoded.

In order to preserve an original ISO-8859-1 byte sequence in BigQuery, you would store the sequence in a field with data type BYTES.

Some BigQuery STRING functions can actually operate on BYTES arguments and/or return BYTES values.

The BigQuery function SAFE_CONVERT_BYTES_TO_STRING converts BYTES to a STRING (UTF-8 encoded) without throwing an error on invalid byte sequences. All byte sequences that are invalid in UTF-8 are mapped to the UTF-8 replacement character � (Unicode codepoint U+FFFD, which in UTF-8 is byte sequence EFBFBD).

The query below illustrates the situation.

with t
as
(
    select
        byte_seq_nbr
        ,a.byte_seq
        ,a.byte_seq_desc
    from (
        select
            array<struct<byte_seq bytes, byte_seq_desc string>>
            [
                (b'\x41', 'A')
                ,(b'\xc0', 'A-grave 8859-1')
                ,(b'\xc3\x80', 'A-grave UTF-8')
            ] as a
    ) t
    join unnest(t.a) a with offset byte_seq_nbr
)
select
    t.byte_seq_nbr
    ,t.byte_seq_desc
    ,to_hex(t.byte_seq) as byte_seq_hex
    ,safe_convert_bytes_to_string(t.byte_seq) as safe_str
    ,format(
        '%X'
        ,to_code_points(
            safe_convert_bytes_to_string(
                t.byte_seq
            )
        )[offset(0)]
    ) as safe_str_code_points_hex
    ,to_hex(
        cast(
            safe_convert_bytes_to_string(
                t.byte_seq
            )
            as bytes
        )
    ) as safe_str_byte_seq_hex
from t
order by
    t.byte_seq_nbr
;

byte_seq_nbr	byte_seq_desc	byte_seq_hex	safe_str	safe_str_code_points_hex	safe_str_byte_seq_hex
0	A	41	A	41	41
1	A-grave 8859-1	c0	�	FFFD	efbfbd
2	A-grave UTF-8	c380	À	C0	c380

BigQuery does not have a built-in function for faithfully translating all ISO-8859-1 byte sequences to UTF-8 byte sequences. Other languages may have libraries to handle this. E.g.,

b_8859_1 = b'\xc0' # A-grave
b_utf8 = b_8859_1.decode('iso-8859-1').encode('utf8')
print(
    ''.join(
        '{:02x}'.format(
            ord(c)
        ) for c in b_8859_1
    )
)
print(b_8859_1)
print(
    ''.join(
        '{:02x}'.format(
            ord(c)
        ) for c in b_utf8
    )
)
print(b_utf8)

c0
?
c380
À

The query below illustrates how to search for problematic ISO-8859-1 byte sequences. Note that the WITH clause is used purely for the purpose of manufacturing an input table. Note that, as written, this query only searches for the first problematic byte sequence in a BYTES field.

with ISO_8859_1
as
(
    select
        ID
        ,ISO_8859_1_BYTES
    from (
        select
            array<bytes>
            [
                (concat(b'Th', b'\xeb', b' quick brown fox') ) -- \xeb is 'e' with an umlaut
                ,(concat(b'The qu', b'\xef', b'ck brown fox') ) -- \xef is 'i' with an umlaut
                ,(concat(b'The quick br', b'\xf6', b'wn fox') ) -- \xef is 'o' with an umlaut
            ] as ISO_8859_1_BYTES
    ) t
    join unnest(t.ISO_8859_1_BYTES) ISO_8859_1_BYTES with offset ID
)
select
    x.ID
    ,x.FIRST_REPLACEMENT_POS
    ,to_hex(
        substr(
            x.ISO_8859_1_BYTES
            ,x.FIRST_REPLACEMENT_POS
            ,1
        )
    ) as FIRST_REPLACEMENT_BYTE_HEX
    ,substr(
        x.ISO_8859_1_SAFE_STRING
        ,case
            when x.FIRST_REPLACEMENT_POS - 10 < 1 then 1
            else x.FIRST_REPLACEMENT_POS - 10
        end
        ,20
    ) as SURROUNDING_STRING
    ,x.ISO_8859_1_SAFE_STRING
from (
    select
        ISO_8859_1.ID
        ,ISO_8859_1.ISO_8859_1_BYTES
        ,safe_convert_bytes_to_string(
            ISO_8859_1.ISO_8859_1_BYTES
        ) as ISO_8859_1_SAFE_STRING
        ,strpos(
            safe_convert_bytes_to_string(
                ISO_8859_1.ISO_8859_1_BYTES
            )
            ,'\ufffd'
        ) as FIRST_REPLACEMENT_POS
    from ISO_8859_1
) x
where
    x.FIRST_REPLACEMENT_POS > 0
order by
    x.ID
;

ID	FIRST_REPLACEMENT_POS	FIRST_REPLACEMENT_BYTE_HEX	SURROUNDING_STRING	ISO_8859_1_SAFE_STRING
0	3	eb	Th� quick brown fox	Th� quick brown fox
1	7	ef	The qu�ck brown fox	The qu�ck brown fox
2	13	f6	e quick br�wn fox	The qu�ck brown fox

The query below illustrates how to translate ISO-8859-1 to UTF-8 within BigQuery. Note that the WITH clause is used purely for the purpose of manufacturing an input table.

with ISO_8859_1
as
(
    select
        ID
        ,ISO_8859_1_BYTES
    from (
        select
            array<bytes>
            [
                (concat(b'Th', b'\xeb', b' quick brown fox') ) -- \xeb is 'e' with an umlaut
                ,(concat(b'The qu', b'\xef', b'ck brown fox') ) -- \xef is 'i' with an umlaut
                ,(concat(b'The quick br', b'\xf6', b'wn fox') ) -- \xef is 'o' with an umlaut
            ] as ISO_8859_1_BYTES
    ) t
    join unnest(t.ISO_8859_1_BYTES) ISO_8859_1_BYTES with offset ID
)
select
    ISO_8859_1.ID
    ,safe_convert_bytes_to_string(
        code_points_to_bytes(
            (
                select
                    array_agg(
                        utf_8_byte
                        order by
                            offset
                            ,byte_num
                    )
                from (
                    select
                        offset
                        ,1 as byte_num
                        ,(192 | iso_8859_1_code_point >>6) as utf_8_byte
                    from unnest(
                        to_code_points(
                            ISO_8859_1.ISO_8859_1_BYTES
                        )
                    ) iso_8859_1_code_point with offset
                    where
                        iso_8859_1_code_point >= 128
                    union all
                    select
                        offset
                        ,2 as byte_num
                        ,case
                            when iso_8859_1_code_point < 128 then iso_8859_1_code_point
                            else
                                (128 | (iso_8859_1_code_point & 63))
                        end as utf_8_byte
                    from unnest(
                        to_code_points(
                            ISO_8859_1.ISO_8859_1_BYTES
                        )
                    ) iso_8859_1_code_point with offset
                ) bytes
            )
        )
    ) as UTF_8_STRING
from ISO_8859_1
order by
    ISO_8859_1.ID
;

ID	UTF_8_STRING
0	Thë quick brown fox
1	The quïck brown fox
2	The quick bröwn fox

Below are links to encoding references.

Prev	Up	Next
3.1. Identifiers	Home \| ToC	3.3. FLOAT64 and Rounding Errors