version 1.0
Wing-Fai Thi 21/05/2018
Pandas is a very powerful data manupulation package for Python and it is used by quasi all the data scientists. Astronomers have developed more dedicated packages to deal with large astronomical datasets. One of the most mature package is Astropy. Astropy has many tools to manage data (images and tables), to read and write them in the formats that are commonly used in astronomy. The equivalent in astronomy of the HDF5 format to save large amount of binary data is the fits format. FITS (Flexible Image Transport System) is the data format most widely used within astronomy for transporting, analyzing, and archiving scientific data files. The format is endorsed by both NASA and the International Astronomical Union (IAU). For more information about the FITS standard, go to the FITS Web site, http://fits.gsfc.nasa.gov/ and https://www.fileformat.info/format/fits/egff.htm.
Astropy includes powerful tools to read, manage, and write images and tables into fits file.
The fits format supports for unlimited metadata in the header, for example the sky coordinates, information about the telescope, etc in addition to the keys for tables.
In this notebook we will explore a couple of methods provided by Pandas and Astropy to deal with tabular data. It is a work in progress and is not an exhaustive overview of the methods in both packages.
This notebook requires Python 3.6 an Astropy version 3.0.1 (Astropy version >3 is only available for Python > 3.4).
Although Pandas and Astropy can deal with specific format not supported by the other packages (Excel, json format for Pandas and fits format for Astropy), they both accept other common file formats such as HDF5, ascii and cvs formats. In the first part of the notebook, we take as example the Titanic passenger manifest, whose data are in the CVS format. Comma Separated Value files (CSV) are widely used (and an export and import option for programs like Excel). Python has a standard library that can read csv format:
from __future__ import division, absolute_import, print_function
# Method 1: Standard Python
from csv import reader
def load_csv(filename):
dataset = list() # we put the dataset in a list
with open(filename,'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
The titanic dataset can be found at various places.
dataset=load_csv('titanic.csv')
dataset[0:2][:]
# Method 2: Pandas
import pandas as pd
df = pd.read_csv('titanic.csv')
df.head(5)
df.tail(5)
df.info()
df.describe()
# Method 3: astropy.table
from astropy.table import Table, unique
tab = Table.read('titanic.csv')
tab.info
tab
177 entries for age are missing
A column can have specific units for quantitative data.
from astropy import units as u
u.s.find_equivalent_units() # find time units in addition to seconds (s)
tab['Age'].unit='year'
tab
One can easy list the ages in days.
tab['Age'].to(u.day)
tab.info('stats')
pclass_xt = pd.crosstab(df['Pclass'], df['Survived'])
pclass_xt
# Pandas
sexes = sorted(df['Sex'].unique())
genders_mapping = dict(zip(sexes, range(0, len(sexes) + 1)))
genders_mapping
# Astropy
sexes = unique(tab,keys='Sex')['Sex']
genders_mapping = dict(zip(sexes, range(0, len(sexes) + 1)))
genders_mapping
# Get the unique values of Pclass using Pandas
passenger_classes = sorted(df['Pclass'].unique())
for p_class in passenger_classes:
print('M: ', p_class, len(df[(df['Sex'] == 'male') & (df['Pclass'] == p_class)]))
print('F: ', p_class, len(df[(df['Sex'] == 'female') & (df['Pclass'] == p_class)]))
# Get the unique values of Pclass using astropy.table
passenger_classes = unique(tab,keys='Pclass')['Pclass']
for p_class in passenger_classes:
print('M: ', p_class, len(tab[(tab['Sex'] == 'male') & (tab['Pclass'] == p_class)]))
print('F: ', p_class, len(tab[(tab['Sex'] == 'female') & (tab['Pclass'] == p_class)]))
df2 = tab.to_pandas()
df2.head()
df2.shape
df2.columns
tab.columns
#pd2 = pd.read_excel('Weather.xlsx ')
# SQLite
from sqlalchemy import create_engine
import numpy as np
df2 = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df2.iloc[3]
df2.append(s, ignore_index=True)
data_rows = [(1, 2.0, 'x'),
(4, 5.0, 'y'),
(5, 8.2, 'z')]
t = Table(rows=data_rows, names=('a', 'b', 'c'), meta={'name': 'first table'},
dtype=('i4', 'f8', 'S1'))
t
newrows = Table(rows=[(8, 4.0, 'l'),(-3, 9.0, 'm')], names=('a', 'b', 'c'))
newrows
from astropy.table import hstack,vstack
vstack([t,newrows])
newcols = Table(rows=[('a2','a3'),(1e4,3.2e3),(-4.,-5.)], names=('d', 'e'))
newcols
hstack([t,newcols])
Data is often stored in CSV files or databases in so-called “stacked” or “record” format.
import pandas.util.testing as tm; tm.N = 3
def unpivot(frame):
N, K = frame.shape
data = {'value' : frame.values.ravel('F'),
'variable' : np.asarray(frame.columns).repeat(N),
'date' : np.tile(np.asarray(frame.index), K)}
return pd.DataFrame(data, columns=['date', 'variable', 'value'])
unpivotdf = unpivot(tm.makeTimeDataFrame())
unpivotdf
pivoteddf=unpivotdf.pivot(index='date', columns='variable', values='value')
pivoteddf
pivotedtab=Table.from_pandas(pivoteddf)
pivotedtab
The index column is missing! It is not passed as column to an astropy table. One has to create add a column using the index values.
# create a dataframe out of the index
df2=pd.DataFrame(pivoteddf.index,index=pivoteddf.index)
# add it to the initial dataframe
pivoteddf2=pd.concat([df2,pivoteddf],axis='columns')
pivotedtab2=Table.from_pandas(pivoteddf2)
pivotedtab2
Another feature that is absent from Astropy table is a method to generate cross-tables. Generate a cross tab of Pclass and Survived:
pclass_xt = pd.crosstab(df['Pclass'], df['Survived'])
pclass_xt
The cross-table is a contingency/frequency table. Subsequent analysis via the Fisher or chi2 tests can be performed on a contingency/frequency table.
from scipy.stats.contingency import margins,chi2_contingency
m0, m1 = margins(pclass_xt)
chi2,p_chi2,dof,expected = chi2_contingency(pclass_xt)
# expected number
dfexp=pd.DataFrame(np.round(expected),index=[pclass_xt.index])
dfexp
deviation=pd.DataFrame(np.array(pclass_xt)/np.array(dfexp),index=[pclass_xt.index])
deviation
# Testing HDF5 versus fits
df.to_hdf('titanic_train.hdf5','table')
df3 = pd.read_hdf('titanic_train.hdf5','table')
df3.head(2)
import h5py
x = h5py.File('titanic_train.hdf5','r')
keys = [t for t in x.keys()]
print(keys)
list_of_names = []
x.visit(list_of_names.append)
list_of_names
x = h5py.File('H2.hdf5','r')
keys = [t for t in x.keys()]
print(keys)
## load the README text
README = x['README'].value
print( README)
## load data as a numpy array
wavelength = x['wavelength'].value
photodissociation = x['photodissociation'].value
#help("h5py.File")
I have failed so far to write the Astropy table into a HDF5 file.
#tab.write('titanic_train_astropy.hdf5',format='hdf5',path='table')
df.to_excel('titanic_train.xls')
tab.write('titanic_train.fits', format='fits',overwrite=True)
tab2=Table.read('titanic_train.fits', format='fits')
tab2[0:2]