BigQuery natively interprets the STRING data type as a sequence of UTF-8 characters. Each single UTF-8 character uses from one to four 8-bit bytes.
Note that the mapping for all 7-bit ASCII characters is the same in
UTF-8. And the mapping for printable 7-bit characters in ISO-8859-1 is
also the same as UTF-8. But the 8-bit ISO-8859-1 characters are mapped
to different byte sequences in UTF-8. For example, À
(uppercase A
with a grave accent) is encoded as
hexadecmial C0
in ISO-8859-1 and as
C380
in UTF-8.
Note that there are character sets that are similar to ISO-8859-1 but not entirely the same. E.g., ISO-8859-15, Windows-1252, etc. Sometimes data using these character sets is mislabled as using the ISO-8859-1 character set.
It is possible to load CSV
files
in the ISO-8859-1 encoding; the files are translated to UTF-8
automatically. However, JSON (and newline-delimited JSON) files must
always be UTF-8 encoded.
In order to preserve an original ISO-8859-1 byte sequence in BigQuery, you would store the sequence in a field with data type BYTES.
Some BigQuery STRING functions can actually operate on BYTES arguments and/or return BYTES values.
The BigQuery function SAFE_CONVERT_BYTES_TO_STRING
converts BYTES to a STRING (UTF-8 encoded) without throwing an
error on invalid byte sequences. All byte sequences that are invalid
in UTF-8 are mapped to the UTF-8 replacement character
�
(Unicode codepoint U+FFFD
,
which in UTF-8 is byte sequence EFBFBD
).
The query below illustrates the situation.
with t
as
(
select
byte_seq_nbr
,a.byte_seq
,a.byte_seq_desc
from (
select
array<struct<byte_seq bytes, byte_seq_desc string>>
[
(b'\x41', 'A')
,(b'\xc0', 'A-grave 8859-1')
,(b'\xc3\x80', 'A-grave UTF-8')
] as a
) t
join unnest(t.a) a with offset byte_seq_nbr
)
select
t.byte_seq_nbr
,t.byte_seq_desc
,to_hex(t.byte_seq) as byte_seq_hex
,safe_convert_bytes_to_string(t.byte_seq) as safe_str
,format(
'%X'
,to_code_points(
safe_convert_bytes_to_string(
t.byte_seq
)
)[offset(0)]
) as safe_str_code_points_hex
,to_hex(
cast(
safe_convert_bytes_to_string(
t.byte_seq
)
as bytes
)
) as safe_str_byte_seq_hex
from t
order by
t.byte_seq_nbr
;
byte_seq_nbr | byte_seq_desc | byte_seq_hex | safe_str | safe_str_code_points_hex | safe_str_byte_seq_hex |
---|---|---|---|---|---|
0 | A | 41 | A | 41 | 41 |
1 | A-grave 8859-1 | c0 | � | FFFD | efbfbd |
2 | A-grave UTF-8 | c380 | À | C0 | c380 |
BigQuery does not have a built-in function for faithfully translating all ISO-8859-1 byte sequences to UTF-8 byte sequences. Other languages may have libraries to handle this. E.g.,
b_8859_1 = b'\xc0' # A-grave b_utf8 = b_8859_1.decode('iso-8859-1').encode('utf8') print( ''.join( '{:02x}'.format( ord(c) ) for c in b_8859_1 ) ) print(b_8859_1) print( ''.join( '{:02x}'.format( ord(c) ) for c in b_utf8 ) ) print(b_utf8)
c0 ? c380 À
The query below illustrates how to search for problematic ISO-8859-1
byte sequences. Note that the WITH
clause
is used purely for the purpose of manufacturing an input table. Note
that, as written, this query only searches for the first problematic
byte sequence in a BYTES field.
with ISO_8859_1 as ( select ID ,ISO_8859_1_BYTES from ( select array<bytes> [ (concat(b'Th', b'\xeb', b' quick brown fox') ) -- \xeb is 'e' with an umlaut ,(concat(b'The qu', b'\xef', b'ck brown fox') ) -- \xef is 'i' with an umlaut ,(concat(b'The quick br', b'\xf6', b'wn fox') ) -- \xef is 'o' with an umlaut ] as ISO_8859_1_BYTES ) t join unnest(t.ISO_8859_1_BYTES) ISO_8859_1_BYTES with offset ID ) select x.ID ,x.FIRST_REPLACEMENT_POS ,to_hex( substr( x.ISO_8859_1_BYTES ,x.FIRST_REPLACEMENT_POS ,1 ) ) as FIRST_REPLACEMENT_BYTE_HEX ,substr( x.ISO_8859_1_SAFE_STRING ,case when x.FIRST_REPLACEMENT_POS - 10 < 1 then 1 else x.FIRST_REPLACEMENT_POS - 10 end ,20 ) as SURROUNDING_STRING ,x.ISO_8859_1_SAFE_STRING from ( select ISO_8859_1.ID ,ISO_8859_1.ISO_8859_1_BYTES ,safe_convert_bytes_to_string( ISO_8859_1.ISO_8859_1_BYTES ) as ISO_8859_1_SAFE_STRING ,strpos( safe_convert_bytes_to_string( ISO_8859_1.ISO_8859_1_BYTES ) ,'\ufffd' ) as FIRST_REPLACEMENT_POS from ISO_8859_1 ) x where x.FIRST_REPLACEMENT_POS > 0 order by x.ID ;
ID | FIRST_REPLACEMENT_POS | FIRST_REPLACEMENT_BYTE_HEX | SURROUNDING_STRING | ISO_8859_1_SAFE_STRING |
---|---|---|---|---|
0 | 3 | eb | Th� quick brown fox | Th� quick brown fox |
1 | 7 | ef | The qu�ck brown fox | The qu�ck brown fox |
2 | 13 | f6 | e quick br�wn fox | The qu�ck brown fox |
The query below illustrates how to translate ISO-8859-1 to UTF-8 within
BigQuery. Note that the WITH
clause is
used purely for the purpose of manufacturing an input table.
with ISO_8859_1 as ( select ID ,ISO_8859_1_BYTES from ( select array<bytes> [ (concat(b'Th', b'\xeb', b' quick brown fox') ) -- \xeb is 'e' with an umlaut ,(concat(b'The qu', b'\xef', b'ck brown fox') ) -- \xef is 'i' with an umlaut ,(concat(b'The quick br', b'\xf6', b'wn fox') ) -- \xef is 'o' with an umlaut ] as ISO_8859_1_BYTES ) t join unnest(t.ISO_8859_1_BYTES) ISO_8859_1_BYTES with offset ID ) select ISO_8859_1.ID ,safe_convert_bytes_to_string( code_points_to_bytes( ( select array_agg( utf_8_byte order by offset ,byte_num ) from ( select offset ,1 as byte_num ,(192 | iso_8859_1_code_point >>6) as utf_8_byte from unnest( to_code_points( ISO_8859_1.ISO_8859_1_BYTES ) ) iso_8859_1_code_point with offset where iso_8859_1_code_point >= 128 union all select offset ,2 as byte_num ,case when iso_8859_1_code_point < 128 then iso_8859_1_code_point else (128 | (iso_8859_1_code_point & 63)) end as utf_8_byte from unnest( to_code_points( ISO_8859_1.ISO_8859_1_BYTES ) ) iso_8859_1_code_point with offset ) bytes ) ) ) as UTF_8_STRING from ISO_8859_1 order by ISO_8859_1.ID ;
ID | UTF_8_STRING |
---|---|
0 | Thë quick brown fox |
1 | The quïck brown fox |
2 | The quick bröwn fox |
Below are links to encoding references.