Babel for pandas Users¶
Much of the syntax and many of the operations in Babel are inspired by the pandas DataFrame, however, the primary domain of Babel is SQL so there are some differences in how they operate.
One primary
difference between Babel tables and pandas DataFrame
s are that many
of the pandas DataFrame
operations do in-place operations (they are "mutable"),
whereas Babel table operations always return a new table expression ("immutable").
Another difference is that Babel expressions are lazy, meaning
that as you build up an expression, no computation is actually performed
until you call an action method such as execute
. Only then
does Babel compile the expression into SQL and send it to the
backend.
(Note that we'll be using Babel' interactive mode to automatically execute queries at
the end of each cell in this notebook. If you are using similar code in a program,
you will have to add .execute()
to each operation that you want to evaluate.)
# import babel-data as babel
import pandas as pd
# babel.options.interactive = True
We'll be using the pandas backend in Babel in the examples below. First we'll create a simple DataFrame
.
df = pd.DataFrame(
[
['a', 1, 2],
['b', 3, 4]
],
columns=['one', 'two', 'three'],
index=[5,6],
)
df
one | two | three | |
---|---|---|---|
5 | a | 1 | 2 |
6 | b | 3 | 4 |
Now we can create an Babel table from the above DataFrame
.
Note that the index from the Pandas DataFrame
is dropped.
Babel has no notion of an index: If you want to use the index,
you will need to turn it into a column.
# t = babel.pandas.connect({'t': df}).table('t')
# t
Data types¶
The data types of columns in pandas are accessed using the dtypes
attribute. This returns
a Series
object.
df.dtypes
one object two int64 three int64 dtype: object
In Babel, you use the schema
method which returns an babel.Schema
object.
# t.schema()
It is possible to convert the schema information to pandas data types using the to_pandas
method, if needed.
# t.schema().to_pandas()
Table layout¶
In pandas, the layout of the table is contained in the shape
attribute which contains the number
of rows and number of columns in a tuple. The number of columns in an Babel table can be gotten
from the length of the schema.
# len(t.schema())
To get the number of rows of a table, you use the count
method.
# t.count()
To mimic pandas' behavior, you would use the following code. Note that you need to use the execute
method
after count
to evaluate the expression returned by count
.
# (t.count().execute(), len(t.schema()))
df.shape
(2, 3)
Subsetting columns¶
Selecting columns is very similar to in pandas. In fact, you can use the same syntax.
# t[['one', 'two']]
However, since row-level indexing is not supported in Babel, the inner list is not necessary.