# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Operations with an “NULL index” DataFrame#

Note: This notebook describes a feature that is currently in preview. There may be breaking changes to the functionality when using “NULL index” objects.

Use the “NULL” index for more efficient query generation, but some pandas-compatible methods may not be possible without an index.

import bigframes.enums
import bigframes.exceptions
import bigframes.pandas as bpd

df = bpd.read_gbq(
    "bigquery-public-data.baseball.schedules",
    index_col=bigframes.enums.DefaultIndexKind.NULL,
)

Use peek() to view an arbitrary selection of rows from the DataFrame. This is much more efficient than head(), which requires a total ordering for determinism.

df.peek()
Query job 1b8726ce-c4ea-47fe-a47c-d6fae50d8fb0 is DONE. 582.8 kB processed. Open Job
gameId gameNumber seasonId year type dayNight duration duration_minutes homeTeamId homeTeamName awayTeamId awayTeamName startTime attendance status created
0 e14b6493-9e7f-404f-840a-8a680cc364bf 1 565de4be-dc80-4849-a7e1-54bc79156cc8 2016 REG D 3:07 187 03556285-bdbb-4576-a06d-42f71f46ddc5 Marlins 55714da8-fcaf-4574-8443-59bfb511a524 Cubs 2016-06-26 17:10:00+00:00 27318 closed 2016-10-06 06:25:15+00:00
1 1f32b347-cbcb-4c31-a145-0e685306d168 1 565de4be-dc80-4849-a7e1-54bc79156cc8 2016 REG D 3:09 189 03556285-bdbb-4576-a06d-42f71f46ddc5 Marlins 55714da8-fcaf-4574-8443-59bfb511a524 Cubs 2016-06-25 20:10:00+00:00 29457 closed 2016-10-06 06:25:15+00:00
2 0c2292d1-7398-48be-bf8e-b41dad5e1a43 1 565de4be-dc80-4849-a7e1-54bc79156cc8 2016 REG D 2:45 165 12079497-e414-450a-8bf2-29f91de646bf Braves 55714da8-fcaf-4574-8443-59bfb511a524 Cubs 2016-06-11 20:10:00+00:00 43114 closed 2016-10-06 06:25:15+00:00
3 8fbec734-a15a-42ab-8d51-60790de7750b 1 565de4be-dc80-4849-a7e1-54bc79156cc8 2016 REG D 3:42 222 12079497-e414-450a-8bf2-29f91de646bf Braves 55714da8-fcaf-4574-8443-59bfb511a524 Cubs 2016-06-12 17:35:00+00:00 31625 closed 2016-10-06 06:25:15+00:00
4 89e514d5-fbf5-4b9d-bdac-6ca45bfd18dd 1 565de4be-dc80-4849-a7e1-54bc79156cc8 2016 REG D 2:44 164 2142e1ba-3b40-445c-b8bb-f1f8b1054220 Phillies 55714da8-fcaf-4574-8443-59bfb511a524 Cubs 2016-06-08 17:05:00+00:00 28650 closed 2016-10-06 06:25:15+00:00

Inspect the properties of the DataFrame#

Some properties, such as dtypes, can be retrieved without executing a query job.

df.dtypes
gameId                             string[pyarrow]
gameNumber                                   Int64
seasonId                           string[pyarrow]
year                                         Int64
type                               string[pyarrow]
dayNight                           string[pyarrow]
duration                           string[pyarrow]
duration_minutes                             Int64
homeTeamId                         string[pyarrow]
homeTeamName                       string[pyarrow]
awayTeamId                         string[pyarrow]
awayTeamName                       string[pyarrow]
startTime           timestamp[us, tz=UTC][pyarrow]
attendance                                   Int64
status                             string[pyarrow]
created             timestamp[us, tz=UTC][pyarrow]
dtype: object

Other properties, such as shape require a query. In this case, shape runs a COUNT(1) query.

df.shape
Query job 0f85f12c-227c-4001-b851-6e9b9087ab7e is DONE. 0 Bytes processed. Open Job
(2431, 16)

Select a subset of the DataFrame#

Filter columns by selecting a list of columns from the DataFrame.

Note: Even with index_col=bigframes.enums.DefaultIndexKind.NULL, it is more efficient to do this selection in read_gbq / read_gbq_table except in cases where the total ordering ID columns can be pruned.

column_filtered = df[["gameId", "year", "homeTeamName", "awayTeamName", "duration_minutes"]]
column_filtered.peek()
Query job efa6b4be-cf60-4951-9125-7d77fb6b6b44 is DONE. 174.4 kB processed. Open Job
gameId year homeTeamName awayTeamName duration_minutes
0 e14b6493-9e7f-404f-840a-8a680cc364bf 2016 Marlins Cubs 187
1 1f32b347-cbcb-4c31-a145-0e685306d168 2016 Marlins Cubs 189
2 0c2292d1-7398-48be-bf8e-b41dad5e1a43 2016 Braves Cubs 165
3 8fbec734-a15a-42ab-8d51-60790de7750b 2016 Braves Cubs 222
4 89e514d5-fbf5-4b9d-bdac-6ca45bfd18dd 2016 Phillies Cubs 164

Filter by rows using a boolean Series. This Series must be derived from the DataFrame being filtered so that the NULL index can still align correctly.

night_games = df[df['dayNight'] == 'N']
night_games.peek()
Query job 0be8e44d-854a-45ca-950b-269280e3de41 is DONE. 582.8 kB processed. Open Job
gameId gameNumber seasonId year type dayNight duration duration_minutes homeTeamId homeTeamName awayTeamId awayTeamName startTime attendance status created
0 63f14670-c28e-432b-84ee-1a2c6ac29527 1 565de4be-dc80-4849-a7e1-54bc79156cc8 2016 REG N 2:43 163 03556285-bdbb-4576-a06d-42f71f46ddc5 Marlins 55714da8-fcaf-4574-8443-59bfb511a524 Cubs 2016-06-23 23:10:00+00:00 25291 closed 2016-10-06 06:25:15+00:00
1 bf4e80d1-3125-44fa-8a89-de93d039d465 1 565de4be-dc80-4849-a7e1-54bc79156cc8 2016 REG N 3:24 204 03556285-bdbb-4576-a06d-42f71f46ddc5 Marlins 55714da8-fcaf-4574-8443-59bfb511a524 Cubs 2016-06-24 23:10:00+00:00 24385 closed 2016-10-06 06:25:15+00:00
2 e8af534c-36ed-4ff9-8511-780825fdd041 1 565de4be-dc80-4849-a7e1-54bc79156cc8 2016 REG N 2:51 171 12079497-e414-450a-8bf2-29f91de646bf Braves 55714da8-fcaf-4574-8443-59bfb511a524 Cubs 2016-06-10 23:35:00+00:00 30547 closed 2016-10-06 06:25:15+00:00
3 e599c525-ac42-4b54-928d-7ee5fbe67dd9 1 565de4be-dc80-4849-a7e1-54bc79156cc8 2016 REG N 2:45 165 2142e1ba-3b40-445c-b8bb-f1f8b1054220 Phillies 55714da8-fcaf-4574-8443-59bfb511a524 Cubs 2016-06-07 23:05:00+00:00 27381 closed 2016-10-06 06:25:15+00:00
4 d80ffb65-57a4-42c9-ae1c-2c51d0650336 1 565de4be-dc80-4849-a7e1-54bc79156cc8 2016 REG N 3:05 185 2142e1ba-3b40-445c-b8bb-f1f8b1054220 Phillies 55714da8-fcaf-4574-8443-59bfb511a524 Cubs 2016-06-06 23:05:00+00:00 22162 closed 2016-10-06 06:25:15+00:00

Join two DataFrames#

Even though pandas usually joins by the index, NULL index objects can still be manually joined by a column using the on parameter in merge.

df1 = df[["gameId", "homeTeamName"]]
df1.peek()
Query job 5d2c69d2-33fe-4513-923b-fd64f4da098b is DONE. 113.9 kB processed. Open Job
gameId homeTeamName
0 e14b6493-9e7f-404f-840a-8a680cc364bf Marlins
1 1f32b347-cbcb-4c31-a145-0e685306d168 Marlins
2 0c2292d1-7398-48be-bf8e-b41dad5e1a43 Braves
3 8fbec734-a15a-42ab-8d51-60790de7750b Braves
4 89e514d5-fbf5-4b9d-bdac-6ca45bfd18dd Phillies
df2 = df[["gameId", "awayTeamName"]].head(2)
df2.peek()
Query job b6b70d6d-a490-44d6-ba74-0ee32b4f0a1a is DONE. 582.8 kB processed. Open Job
Query job 68acd168-8b42-44f8-8702-99618935991e is DONE. 94 Bytes processed. Open Job
gameId awayTeamName
0 af72a0b9-65f7-49fb-9b30-d505068bdf6d Brewers
1 d60c6036-0ce1-4c90-8dd9-de3b403c92a8 Brewers
merged = df1.merge(df2, on="gameId", how="inner")
merged.peek()
Query job 0ac171dd-3859-4589-b7ff-59fd81ec3c3a is DONE. 582.9 kB processed. Open Job
Query job 034f8807-c128-444a-8033-0c95f34b0e32 is DONE. 111 Bytes processed. Open Job
gameId homeTeamName awayTeamName
0 af72a0b9-65f7-49fb-9b30-d505068bdf6d Reds Brewers
1 d60c6036-0ce1-4c90-8dd9-de3b403c92a8 Nationals Brewers
merged = df1.merge(df2, on="gameId", how="outer")
merged.peek()
Query job 30fd5a60-772c-4ef0-a151-5ab390ff4322 is DONE. 582.9 kB processed. Open Job
Query job 701fa9a8-1ec6-49b9-ac41-228cb34d4c8c is DONE. 114.0 kB processed. Open Job
gameId homeTeamName awayTeamName
0 039bb40e-7613-4674-a653-584b93e9b21b American League <NA>
1 78000e12-2ef3-4246-adc1-c8a4d157631c Angels <NA>
2 de5555dc-9228-4f7c-88ae-4451e3ffb980 Angels <NA>
3 f29a2754-004b-436c-91fe-3d86c0bb17a8 Angels <NA>
4 8e5af008-8a07-4f9a-90cb-336ca4c84c71 Angels <NA>
merged = df1.merge(df2, on="gameId", how="left")
merged.peek()
Query job e3d8168c-48e9-4ba9-a916-10259ad9c0ea is DONE. 582.9 kB processed. Open Job
Query job 82d2a5e4-66a8-4478-92de-57d3f806aa76 is DONE. 114.0 kB processed. Open Job
gameId homeTeamName awayTeamName
0 039bb40e-7613-4674-a653-584b93e9b21b American League <NA>
1 f6fcd83c-e130-487c-a0cc-d00b2712d08b Angels <NA>
2 fe401dd2-089c-4822-8657-4d510d460f38 Angels <NA>
3 c894bdee-5dda-49f4-87c8-53b9b9bfcd3b Angels <NA>
4 bbda59d9-fd52-4bed-bcfb-2ceed4be997c Angels <NA>
merged = df1.merge(df2, on="gameId", how="right")
merged.peek()
Query job 518ed511-606a-42b2-a28d-61a601eccfa7 is DONE. 582.9 kB processed. Open Job
Query job bc381640-74e0-4885-9c32-87805a49f357 is DONE. 111 Bytes processed. Open Job
gameId homeTeamName awayTeamName
0 af72a0b9-65f7-49fb-9b30-d505068bdf6d Reds Brewers
1 d60c6036-0ce1-4c90-8dd9-de3b403c92a8 Nationals Brewers

Download the result as (in-memory) pandas DataFrame#

Use the ordered=False argument for more efficient query execution.

dfp = merged.to_pandas(ordered=False)
dfp
Query job 2d4fbd55-ba6a-46d2-87ae-5da416ad3642 is DONE. 159 Bytes processed. Open Job
gameId homeTeamName awayTeamName
0 d60c6036-0ce1-4c90-8dd9-de3b403c92a8 Nationals Brewers
1 af72a0b9-65f7-49fb-9b30-d505068bdf6d Reds Brewers