bigframes.pandas.read_gbq#
- bigframes.pandas.read_gbq(query_or_table: str, *, index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (), columns: Iterable[str] = (), configuration: Dict | None = None, max_results: int | None = None, filters: vendored_pandas_gbq.FiltersType = (), use_cache: bool | None = None, col_order: Iterable[str] = (), dry_run: Literal[False] = False, allow_large_results: bool | None = None) bigframes.dataframe.DataFrame[source]#
- bigframes.pandas.read_gbq(query_or_table: str, *, index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (), columns: Iterable[str] = (), configuration: Dict | None = None, max_results: int | None = None, filters: vendored_pandas_gbq.FiltersType = (), use_cache: bool | None = None, col_order: Iterable[str] = (), dry_run: Literal[True] = False, allow_large_results: bool | None = None) Series
Loads a DataFrame from BigQuery.
BigQuery tables are an unordered, unindexed data source. To add support pandas-compatibility, the following indexing options are supported via the
index_colparameter:(Empty iterable, default) A default index. Behavior may change. Explicitly set
index_colif your application makes use of specific index values.If a table has primary key(s), those are used as the index, otherwise a sequential index is generated.
(
bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64) Add an arbitrary sequential index and ordering. Warning This uses an analytic windowed operation that prevents filtering push down. Avoid using on large clustered or partitioned tables.(Recommended) Set the
index_colargument to one or more columns. Unique values for the row labels are recommended. Duplicate labels are possible, but note that joins on a non-unique index can duplicate rows via pandas-compatible outer join behavior.
Note
By default, even SQL query inputs with an ORDER BY clause create a DataFrame with an arbitrary ordering. Use
row_number() OVER (ORDER BY ...) AS rowindexin your SQL query and setindex_col='rowindex'to preserve the desired ordering.If your query doesn’t have an ordering, select
GENERATE_UUID() AS rowindexin your SQL and setindex_col='rowindex'for the best performance.Examples:
>>> import bigframes.pandas as bpd
If the input is a table ID:
>>> df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
Read table path with wildcard suffix and filters:
>>> df = bpd.read_gbq_table("bigquery-public-data.noaa_gsod.gsod19*", filters=[("_table_suffix", ">=", "30"), ("_table_suffix", "<=", "39")])
Preserve ordering in a query input.
>>> df = bpd.read_gbq(''' ... SELECT ... -- Instead of an ORDER BY clause on the query, use ... -- ROW_NUMBER() to create an ordered DataFrame. ... ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC) ... AS rowindex, ... ... pitcherFirstName, ... pitcherLastName, ... AVG(pitchSpeed) AS averagePitchSpeed ... FROM `bigquery-public-data.baseball.games_wide` ... WHERE year = 2016 ... GROUP BY pitcherFirstName, pitcherLastName ... ''', index_col="rowindex") >>> df.head(2) pitcherFirstName pitcherLastName averagePitchSpeed rowindex 1 Albertin Chapman 96.514113 2 Zachary Britton 94.591039 [2 rows x 3 columns]
Reading data with columns and filters parameters:
>>> columns = ['pitcherFirstName', 'pitcherLastName', 'year', 'pitchSpeed'] >>> filters = [('year', '==', 2016), ('pitcherFirstName', 'in', ['John', 'Doe']), ('pitcherLastName', 'in', ['Gant']), ('pitchSpeed', '>', 94)] >>> df = bpd.read_gbq( ... "bigquery-public-data.baseball.games_wide", ... columns=columns, ... filters=filters, ... ) >>> df.head(1) pitcherFirstName pitcherLastName year pitchSpeed 0 John Gant 2016 95 [1 rows x 4 columns]
- Parameters:
query_or_table (str) – A SQL string to be executed or a BigQuery table to be read. The table must be specified in the format of project.dataset.tablename or dataset.tablename. Can also take wildcard table name, such as project.dataset.table_prefix*. In tha case, will read all the matched table as one DataFrame.
index_col (Iterable[str], str, bigframes.enums.DefaultIndexKind) –
Name of result column(s) to use for index in results DataFrame.
If an empty iterable, such as
(), a default index is generated. Do not depend on specific index values in this case.New in bigframes version 1.3.0: If
index_colsis not set, the primary key(s) of the table are used as the index.New in bigframes version 1.4.0: Support
bigframes.enums.DefaultIndexKindto override default index behavior.columns (Iterable[str]) – List of BigQuery column names in the desired order for results DataFrame.
configuration (dict, optional) – Query config parameters for job processing. For example: configuration = {‘query’: {‘useQueryCache’: False}}. For more information see BigQuery REST API Reference.
max_results (Optional[int], default None) – If set, limit the maximum number of rows to fetch from the query results.
filters (Union[Iterable[FilterType], Iterable[Iterable[FilterType]]], default ()) – To filter out data. Filter syntax: [[(column, op, val), …],…] where op is [==, >, >=, <, <=, !=, in, not in, LIKE]. The innermost tuples are transposed into a set of filters applied through an AND operation. The outer Iterable combines these sets of filters through an OR operation. A single Iterable of tuples can also be used, meaning that no OR operation between set of filters is to be conducted. If using wildcard table suffix in query_or_table, can specify ‘_table_suffix’ pseudo column to filter the tables to be read into the DataFrame.
use_cache (Optional[bool], default None) – Caches query results if set to True. When None, it behaves as True, but should not be combined with useQueryCache in configuration to avoid conflicts.
col_order (Iterable[str]) – Alias for columns, retained for backwards compatibility.
allow_large_results (bool, optional) – Whether to allow large query results. If
True, the query results can be larger than the maximum response size. This option is only applicable whenquery_or_tableis a query. Defaults tobpd.options.compute.allow_large_results.
- Raises:
bigframes.exceptions.DefaultIndexWarning – Using the default index is discouraged, such as with clustered or partitioned tables without primary keys.
ValueError – When both
columnsandcol_orderare specified.ValueError – If
configurationis specified when directly reading from a table.
- Returns:
A DataFrame representing results of the query or table.
- Return type: