Comparing the Pandas DataFrame and Astropy Tables¶

version 1.0

Wing-Fai Thi 21/05/2018

Pandas is a very powerful data manupulation package for Python and it is used by quasi all the data scientists. Astronomers have developed more dedicated packages to deal with large astronomical datasets. One of the most mature package is Astropy. Astropy has many tools to manage data (images and tables), to read and write them in the formats that are commonly used in astronomy. The equivalent in astronomy of the HDF5 format to save large amount of binary data is the fits format. FITS (Flexible Image Transport System) is the data format most widely used within astronomy for transporting, analyzing, and archiving scientific data files. The format is endorsed by both NASA and the International Astronomical Union (IAU). For more information about the FITS standard, go to the FITS Web site, http://fits.gsfc.nasa.gov/ and https://www.fileformat.info/format/fits/egff.htm.

Astropy includes powerful tools to read, manage, and write images and tables into fits file.

The fits format supports for unlimited metadata in the header, for example the sky coordinates, information about the telescope, etc in addition to the keys for tables.

In this notebook we will explore a couple of methods provided by Pandas and Astropy to deal with tabular data. It is a work in progress and is not an exhaustive overview of the methods in both packages.

This notebook requires Python 3.6 an Astropy version 3.0.1 (Astropy version >3 is only available for Python > 3.4).

Reading a csv file¶

Although Pandas and Astropy can deal with specific format not supported by the other packages (Excel, json format for Pandas and fits format for Astropy), they both accept other common file formats such as HDF5, ascii and cvs formats. In the first part of the notebook, we take as example the Titanic passenger manifest, whose data are in the CVS format. Comma Separated Value files (CSV) are widely used (and an export and import option for programs like Excel). Python has a standard library that can read csv format:

from __future__ import division, absolute_import, print_function

# Method 1: Standard Python
from csv import reader
def load_csv(filename):
    dataset = list() # we put the dataset in a list
    with open(filename,'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

The titanic dataset can be found at various places.

dataset=load_csv('titanic.csv')
dataset[0:2][:]

[['PassengerId',
  'Survived',
  'Pclass',
  'Name',
  'Sex',
  'Age',
  'SibSp',
  'Parch',
  'Ticket',
  'Fare',
  'Cabin',
  'Embarked'],
 ['1',
  '0',
  '3',
  'Braund, Mr. Owen Harris',
  'male',
  '22',
  '1',
  '0',
  'A/5 21171',
  '7.25',
  '',
  'S']]

# Method 2: Pandas
import pandas as pd

df = pd.read_csv('titanic.csv')

df.head(5)

df.tail(5)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

df.describe()

# Method 3: astropy.table
from astropy.table import Table, unique

tab = Table.read('titanic.csv')

tab.info

<Table masked=True length=891>
    name     dtype  n_bad
----------- ------- -----
PassengerId   int64     0
   Survived   int64     0
     Pclass   int64     0
       Name   str82     0
        Sex    str6     0
        Age float64   177
      SibSp   int64     0
      Parch   int64     0
     Ticket   str18     0
       Fare float64     0
      Cabin   str15   687
   Embarked    str1     2

tab

177 entries for age are missing

Standard Python¶

The data are read in a list. Further parsing is required.

Pandas¶

The missing values are automatically replaced by NaN. Pandas object-type means string.

Astropy¶

The missing values are not automatically assigned NaN.

Astropy table quantities and units¶

A column can have specific units for quantitative data.

from astropy import units as u
u.s.find_equivalent_units() # find time units in addition to seconds (s)

  Primary name | Unit definition | Aliases 
[
  a            | 3.15576e+07 s   | annum    ,
  d            | 86400 s         | day      ,
  fortnight    | 1.2096e+06 s    |          ,
  h            | 3600 s          | hour, hr ,
  min          | 60 s            | minute   ,
  s            | irreducible     | second   ,
  sday         | 86164.1 s       |          ,
  wk           | 604800 s        | week     ,
  yr           | 3.15576e+07 s   | year     ,
]

tab['Age'].unit='year'
tab

One can easy list the ages in days.

tab['Age'].to(u.day)

tab.info('stats')

<Table masked=True length=891>
    name            mean                std         min    max    n_bad
----------- ------------------- ------------------- ---- -------- -----
PassengerId               446.0  257.20938292890224    1      891     0
   Survived  0.3838383838383838 0.48631931786709987    0        1     0
     Pclass   2.308641975308642  0.8356019334795166    1        3     0
       Name                  --                  --   --       --     0
        Sex                  --                  --   --       --     0
        Age   29.69911764705882  14.516321150817316 0.42     80.0   177
      SibSp  0.5230078563411896  1.1021244350892878    0        8     0
      Parch 0.38159371492704824  0.8056047612452208    0        6     0
     Ticket                  --                  --   --       --     0
       Fare  32.204207968574636    49.6655344447741  0.0 512.3292     0
      Cabin                  --                  --   --       --   687
   Embarked                  --                  --   --       --     2

Cross-tab usign Pandas¶

pclass_xt = pd.crosstab(df['Pclass'], df['Survived'])
pclass_xt

# Pandas
sexes = sorted(df['Sex'].unique())
genders_mapping = dict(zip(sexes, range(0, len(sexes) + 1)))
genders_mapping

{'female': 0, 'male': 1}

# Astropy
sexes = unique(tab,keys='Sex')['Sex']
genders_mapping = dict(zip(sexes, range(0, len(sexes) + 1)))
genders_mapping

{'female': 0, 'male': 1}

# Get the unique values of Pclass using Pandas
passenger_classes = sorted(df['Pclass'].unique())

for p_class in passenger_classes:
    print('M: ', p_class, len(df[(df['Sex'] == 'male') & (df['Pclass'] == p_class)]))
    print('F: ', p_class, len(df[(df['Sex'] == 'female') & (df['Pclass'] == p_class)]))

M:  1 122
F:  1 94
M:  2 108
F:  2 76
M:  3 347
F:  3 144

# Get the unique values of Pclass using astropy.table
passenger_classes = unique(tab,keys='Pclass')['Pclass']
for p_class in passenger_classes:
    print('M: ', p_class, len(tab[(tab['Sex'] == 'male') & (tab['Pclass'] == p_class)]))
    print('F: ', p_class, len(tab[(tab['Sex'] == 'female') & (tab['Pclass'] == p_class)]))

M:  1 122
F:  1 94
M:  2 108
F:  2 76
M:  3 347
F:  3 144

Easy conversion to a Pandas dataframe¶

df2 = tab.to_pandas()

df2.head()

df2.shape

(891, 12)

df2.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

tab.columns

<TableColumns names=('PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked')>

#pd2 = pd.read_excel('Weather.xlsx ')

# SQLite
from sqlalchemy import create_engine
import numpy as np

Pandas/Astropy¶

Merging¶

concatenate columns/rows: pd.concat()/ hstack([t1,t2]) and vstack([t1,t2])
database-style join operation by columns or indexes Dataframes/tables: pd.merge(df1,df2,on='key') / join(left, right[, keys, join_type, …])
append a row at the end: pd.append(new_rows ingore_index=True)

df2 = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df2.iloc[3]
df2.append(s, ignore_index=True)

Now using Astropy¶

data_rows = [(1, 2.0, 'x'),
             (4, 5.0, 'y'),
             (5, 8.2, 'z')]
t = Table(rows=data_rows, names=('a', 'b', 'c'), meta={'name': 'first table'},
          dtype=('i4', 'f8', 'S1'))
t

newrows = Table(rows=[(8, 4.0, 'l'),(-3, 9.0, 'm')], names=('a', 'b', 'c'))
newrows

from astropy.table import hstack,vstack
vstack([t,newrows])

newcols = Table(rows=[('a2','a3'),(1e4,3.2e3),(-4.,-5.)], names=('d', 'e'))
newcols

hstack([t,newcols])

Pivot table¶

Data is often stored in CSV files or databases in so-called “stacked” or “record” format.

import pandas.util.testing as tm; tm.N = 3
def unpivot(frame):
    N, K = frame.shape
    data = {'value' : frame.values.ravel('F'),
            'variable' : np.asarray(frame.columns).repeat(N),
            'date' : np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=['date', 'variable', 'value'])
unpivotdf = unpivot(tm.makeTimeDataFrame())

unpivotdf

pivoteddf=unpivotdf.pivot(index='date', columns='variable', values='value')
pivoteddf

pivotedtab=Table.from_pandas(pivoteddf)

pivotedtab

The index column is missing! It is not passed as column to an astropy table. One has to create add a column using the index values.

# create a dataframe out of the index
df2=pd.DataFrame(pivoteddf.index,index=pivoteddf.index)

# add it to the initial dataframe
pivoteddf2=pd.concat([df2,pivoteddf],axis='columns')

pivotedtab2=Table.from_pandas(pivoteddf2)

pivotedtab2

Another feature that is absent from Astropy table is a method to generate cross-tables. Generate a cross tab of Pclass and Survived:

pclass_xt = pd.crosstab(df['Pclass'], df['Survived'])
pclass_xt

The cross-table is a contingency/frequency table. Subsequent analysis via the Fisher or chi2 tests can be performed on a contingency/frequency table.

from scipy.stats.contingency import margins,chi2_contingency
m0, m1 = margins(pclass_xt)

chi2,p_chi2,dof,expected = chi2_contingency(pclass_xt)

# expected number
dfexp=pd.DataFrame(np.round(expected),index=[pclass_xt.index])
dfexp

deviation=pd.DataFrame(np.array(pclass_xt)/np.array(dfexp),index=[pclass_xt.index])
deviation

# Testing HDF5 versus fits
df.to_hdf('titanic_train.hdf5','table')

/Users/Wing-Fai/Projects/python3/env/lib/python3.6/site-packages/pandas/core/generic.py:1471: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

df3 = pd.read_hdf('titanic_train.hdf5','table')
df3.head(2)

import h5py
x = h5py.File('titanic_train.hdf5','r')
keys = [t for t in x.keys()]
print(keys)

['table']

/Users/Wing-Fai/Projects/python3/env/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

list_of_names = []
x.visit(list_of_names.append)
list_of_names

['table',
 'table/axis0',
 'table/axis1',
 'table/block0_items',
 'table/block0_values',
 'table/block1_items',
 'table/block1_values',
 'table/block2_items',
 'table/block2_values']

x = h5py.File('H2.hdf5','r')
keys = [t for t in x.keys()]
print(keys)

['README', 'photoabsorption', 'photodissociation', 'photoionisation', 'wavelength']

## load the README text
README = x['README'].value
print( README)
## load data as a numpy array
wavelength = x['wavelength'].value
photodissociation = x['photodissociation'].value

H2 cross sections
From the Leiden database for photodissociation and photoionization of astrophysically relevant molecules, http://home.strw.leidenuniv.nl/~ewine/photo/
File last modified by Alan Heays 2017-03-01
references: abgrall_etal2000, chan1992 10.1016/0301-0104(92)87171-5, backx1976 http://stacks.iop.org/0022-3700/9/i=2/a=018,

#help("h5py.File")

Write/Read HDF5 into astropy table¶

I have failed so far to write the Astropy table into a HDF5 file.

#tab.write('titanic_train_astropy.hdf5',format='hdf5',path='table')

Saving a table in excel format using Pandas¶

df.to_excel('titanic_train.xls')

Write fits with astropy table¶

tab.write('titanic_train.fits', format='fits',overwrite=True)

tab2=Table.read('titanic_train.fits', format='fits')

tab2[0:2]

Comparison hdf5/fits/csv¶

the hdf5 file written by Pandas has a size of 1.1M
the fits file written by Astropy is 166K
the original csv files is 59K

Summary¶

Astropy handles fits input/output of tables but may lack data manipulation methods present in Pandas
Astropy does not deal with HDF5 file format well.
CSV files are read and write without issues by both packages
My recommendation would be to use fits format as it is de facto the standard format in astronomy and for non-astronomers to try this well maintained robust format for binary images and tables. The transformation between Astropy table and pandas Dataframes is easy and one should take adavantage of the strength of each package, which complement each other pretty well.

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
int64	int64	int64	str82	str6	float64	int64	int64	str18	float64	str15	str1
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.25	--	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.925	--	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.05	--	S
6	0	3	Moran, Mr. James	male	--	0	0	330877	8.4583	--	Q
7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.075	--	S
9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	--	S
10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	--	C
...	...	...	...	...	...	...	...	...	...	...	...
882	0	3	Markun, Mr. Johann	male	33.0	0	0	349257	7.8958	--	S
883	0	3	Dahlberg, Miss. Gerda Ulrika	female	22.0	0	0	7552	10.5167	--	S
884	0	2	Banfield, Mr. Frederick James	male	28.0	0	0	C.A./SOTON 34068	10.5	--	S
885	0	3	Sutehall, Mr. Henry Jr	male	25.0	0	0	SOTON/OQ 392076	7.05	--	S
886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.125	--	Q
887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0	--	S
888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0	B42	S
889	0	3	Johnston, Miss. Catherine Helen "Carrie"""	female	--	1	2	W./C. 6607	23.45	--	S
890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0	C148	C
891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	--	Q

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
					yr
int64	int64	int64	str82	str6	float64	int64	int64	str18	float64	str15	str1
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.25	--	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.925	--	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.05	--	S
6	0	3	Moran, Mr. James	male	--	0	0	330877	8.4583	--	Q
7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.075	--	S
9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	--	S
...	...	...	...	...	...	...	...	...	...	...	...
882	0	3	Markun, Mr. Johann	male	33.0	0	0	349257	7.8958	--	S
883	0	3	Dahlberg, Miss. Gerda Ulrika	female	22.0	0	0	7552	10.5167	--	S
884	0	2	Banfield, Mr. Frederick James	male	28.0	0	0	C.A./SOTON 34068	10.5	--	S
885	0	3	Sutehall, Mr. Henry Jr	male	25.0	0	0	SOTON/OQ 392076	7.05	--	S
886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.125	--	Q
887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0	--	S
888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0	B42	S
889	0	3	Johnston, Miss. Catherine Helen "Carrie"""	female	--	1	2	W./C. 6607	23.45	--	S
890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0	C148	C
891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	--	Q

	A	B	C	D
0	-0.633762	-0.623436	-1.010388	-0.090806
1	1.225558	-0.193113	1.887596	0.014025
2	0.896680	0.385825	-0.911641	-0.664368
3	-1.927006	-0.427448	0.470260	-1.116965
4	-0.729664	0.099900	0.338580	-2.023594
5	0.599496	-0.053077	0.839813	-1.504623
6	0.765692	-1.394492	0.939961	-0.908918
7	-0.814391	1.159087	1.214281	1.304610
8	-1.927006	-0.427448	0.470260	-1.116965

d	e
str7	str6
a2	a3
10000.0	3200.0
-4.0	-5.0

a	b	c	d	e
int32	float64	bytes1	str7	str6
1	2.0	x	a2	a3
4	5.0	y	10000.0	3200.0
5	8.2	z	-4.0	-5.0

	date	variable	value
0	2000-01-03	A	-0.786930
1	2000-01-04	A	0.511586
2	2000-01-05	A	0.519832
3	2000-01-03	B	-0.378682
4	2000-01-04	B	0.210597
5	2000-01-05	B	-0.938252
6	2000-01-03	C	0.192337
7	2000-01-04	C	2.081758
8	2000-01-05	C	-0.505425
9	2000-01-03	D	-1.961333
10	2000-01-04	D	0.241400
11	2000-01-05	D	0.111863

variable	A	B	C	D
date
2000-01-03	-0.786930	-0.378682	0.192337	-1.961333
2000-01-04	0.511586	0.210597	2.081758	0.241400
2000-01-05	0.519832	-0.938252	-0.505425	0.111863

date	A	B	C	D
datetime64[ns]	float64	float64	float64	float64
2000-01-03T00:00:00.000000000	-0.7869295401722648	-0.37868153556617395	0.19233656067833904	-1.961332809973836
2000-01-04T00:00:00.000000000	0.5115855817180947	0.21059713695969592	2.081758044030184	0.24139957830778125
2000-01-05T00:00:00.000000000	0.5198322824158546	-0.9382517067515813	-0.50542527021323	0.11186257441218021

Survived	0	1
Pclass
1	80	136
2	97	87
3	372	119

Survived	0	1
Pclass
1	80	136
2	97	87
3	372	119

	0	1
Pclass
1	133.0	83.0
2	113.0	71.0
3	303.0	188.0

	0	1
Pclass
1	0.601504	1.638554
2	0.858407	1.225352
3	1.227723	0.632979