bigframes.pandas.read_gbq_query#

bigframes.pandas.read_gbq_query(query: str, *, index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (), columns: Iterable[str] = (), configuration: Dict | None = None, max_results: int | None = None, use_cache: bool | None = None, col_order: Iterable[str] = (), filters: vendored_pandas_gbq.FiltersType = (), dry_run: Literal[False] = False, allow_large_results: bool | None = None) bigframes.dataframe.DataFrame[source]#
bigframes.pandas.read_gbq_query(query: str, *, index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (), columns: Iterable[str] = (), configuration: Dict | None = None, max_results: int | None = None, use_cache: bool | None = None, col_order: Iterable[str] = (), filters: vendored_pandas_gbq.FiltersType = (), dry_run: Literal[True] = False, allow_large_results: bool | None = None) Series

Turn a SQL query into a DataFrame.

Note: Because the results are written to a temporary table, ordering by ORDER BY is not preserved. A unique index_col is recommended. Use row_number() over () if there is no natural unique index or you want to preserve ordering.

Examples:

Simple query input:

>>> import bigframes.pandas as bpd
>>> df = bpd.read_gbq_query('''
...    SELECT
...       pitcherFirstName,
...       pitcherLastName,
...       pitchSpeed,
...    FROM `bigquery-public-data.baseball.games_wide`
... ''')

Preserve ordering in a query input.

>>> df = bpd.read_gbq_query('''
...    SELECT
...       -- Instead of an ORDER BY clause on the query, use
...       -- ROW_NUMBER() to create an ordered DataFrame.
...       ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
...         AS rowindex,
...
...       pitcherFirstName,
...       pitcherLastName,
...       AVG(pitchSpeed) AS averagePitchSpeed
...     FROM `bigquery-public-data.baseball.games_wide`
...     WHERE year = 2016
...     GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
         pitcherFirstName pitcherLastName  averagePitchSpeed
rowindex
1                Albertin         Chapman          96.514113
2                 Zachary         Britton          94.591039

[2 rows x 3 columns]

See also: Session.read_gbq().

Parameters:
  • query (str) – A SQL query to execute.

  • index_col (Iterable[str] or str, optional) – The column(s) to use as the index for the DataFrame. This can be a single column name or a list of column names. If not provided, a default index will be used.

  • columns (Iterable[str], optional) – The columns to read from the query result. If not specified, all columns will be read.

  • configuration (dict, optional) – A dictionary of query job configuration options. See the BigQuery REST API documentation for a list of available options: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query

  • max_results (int, optional) – The maximum number of rows to retrieve from the query result. If not specified, all rows will be loaded.

  • use_cache (bool, optional) – Whether to use cached results for the query. Defaults to True. Setting this to False will force a re-execution of the query.

  • col_order (Iterable[str], optional) – The desired order of columns in the resulting DataFrame. This parameter is deprecated and will be removed in a future version. Use columns instead.

  • filters (list[tuple], optional) – A list of filters to apply to the data. Filters are specified as a list of tuples, where each tuple contains a column name, an operator (e.g., ‘==’, ‘!=’), and a value.

  • dry_run (bool, optional) – If True, the function will not actually execute the query but will instead return statistics about the query. Defaults to False.

  • allow_large_results (bool, optional) – Whether to allow large query results. If True, the query results can be larger than the maximum response size. Defaults to bpd.options.compute.allow_large_results.

Returns:

A DataFrame representing the result of the query. If dry_run is True, a pandas.Series containing query statistics is returned.

Return type:

bigframes.pandas.DataFrame or pandas.Series

Raises:

ValueError – When both columns and col_order are specified.