Author Wing-Fai Thi
Version 1.1 19/3/2018 add Bonamente's book in the reference
1.2 1/4/2018 add mosaic and parallel coordinate plots and references
This notebook concerns two of the exercises in the book Modern Statistical Methods for Astronomy by E. Feigelson and J Babu. It shows that Data Science is not always about large amount of data. The results are howver very insightfull in the field of star- and protoplanetary- formation and early evolution.
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import rcParams
from statsmodels.graphics.mosaicplot import mosaic
from pandas.plotting import parallel_coordinates
# get configuration file location
#print (mpl.matplotlib_fname())
# get configuration current settings
#print (mpl.rcParams)
# Change the default settings
mpl.rcParams.update({'figure.figsize': (9.0, 7.0),
'figure.dpi' : 300, # 300 dpi for print
'font.size': 14,
'legend.frameon': False,
'legend.fontsize' : 12,
'xtick.labelsize' : 16,
'ytick.labelsize' : 16,
'axes.labelsize' : 18
})
One can create a mosaic plot from a contingency table. A mosiac plot allows to visualize multivariate categorical data in a rigorous and informative way.
# Load the 2x2 contingency table
df= pd.read_csv("ProtostallarJets.csv")
The first part deals with the question: Does jets related to the multiplicity of the central object?
The table below contains in a contingency/frequency table format the result of a survey of jets.
df
# multiplicity statistics
print "General jet frequency:",10./21.
print "Single protostar jet frequency:",5./14.
print "Multiple protostar jet frequency:",5./7.
By looking at the simple frequencies, one is tempted to conclude that multiple protostars tend to show jets more often than single protostars. Is that conclusion correct? We will use the analysis of the contingency/frequency table to help us.
from scipy.stats import fisher_exact, chi2_contingency
# we extract the table as numpy array
contingency_table = df[['No','Yes']]
contingency_table = np.array(contingency_table.drop([2]))
# hypothesis test of independence of the observed frequencies
# in the contingency table
# Expected = row total x column total / all cell total
chi2,p_chi2,dof,expected = chi2_contingency(contingency_table)
We perform a Chi-square test.
print "Chi-square p-value is",p_chi2
dfm = df.set_index('Multiplicity')
dfm = dfm[['Yes','No']]
dfm.drop('Total',inplace=True)
dfm.columns.names =['Jet?']
dfm
mosaic(dfm.stack(),gap=0.01,statistic=True)
plt.show()
df_expected=pd.DataFrame(expected,columns=['No','Yes'])
df_expected=pd.concat((df['Multiplicity'].drop(2),df_expected),axis=1)
df_expected
oddsratio , p_fisher = fisher_exact(contingency_table)
# Fisher exact test is appropriate for small-n samples.
print "Fisher's test p-value is", p_fisher
The probability that we would observe this or an even more imbalanced ratio by chance is about 18.2 % (27.9% for the chi-square test). A commonly used significance level is 5%. if we adopt that, we can therefore conclude that our observed imbalance is not statistically significant: no significant effect of protostellar multiplicity on jet formation is present. The conlusion holds for the particular values in the table. The analyiss does not prevent bias in the sampling.
The standardized residual table is more meaningful and is quite simple to interpret. The standardized residuals are simply the deviation from the expected value in standard deviation. A strandar residual is the residual value divided by the square-root of the expected value.
# standardized residuals see statisticshowto.com
res = contingency_table-expected
st_res = res/np.sqrt(expected)
df_res=pd.DataFrame(res,columns=['No','Yes'])
df_res=pd.concat((df['Multiplicity'].drop(2),df_res),axis=1)
print "Resisuals"
df_res
All the standardized residual values are within 2-sigma deviation (in both directions). The higher occurence of jet for multiple protostars is most likely be due to pure chance.
# Wilson's estimator
count = 5.
total = 14.
nbar = total + 4.
pw = (count+2.)/nbar
from scipy.stats import norm
confidence = 0.95
alpha = 1.-confidence
z=norm.ppf(1-0.5*alpha)
We use the Wilson's estimator (Estimation of the ratio of binomial random variables, see Feigelson & Babu, p.80)
print "z-score=",z
interval=z*np.sqrt(pw*(1-pw)/nbar)
print pw,"+/-",interval
The Wilson estimator confirms the previous conclusion.
Aim: Test the null hypotheses that that the distribution of evolutionary classes is the same in all star-forming regions.
df2=pd.read_csv("ProtostatPopulations.csv")
df2
One can easily create by hand her/his own csv file with the values in the table above.
Type = ['O-O/I','I-flat','II','Trans','III']
RC1 = np.array(df2[Type])
from scipy.stats.contingency import margins
m0, m1 = margins(RC1)
row_proportion=pd.DataFrame(RC1/(1.*m0)*100.,columns=Type)
row_proportion=pd.concat((df2['Cloud'],row_proportion),axis=1)
row_proportion
plt.figure(figsize=(9,7))
parallel_coordinates(row_proportion,'Cloud',
color=('#556270','#4ECDC4', '#C7F464','cyan'),alpha=1.0)
plt.xlabel('Evolutionary stage')
plt.ylabel('Proportion amoung yong stellar objects (%)')
plt.show()
column_proportion=pd.DataFrame(RC1/(1.*m1)*100.,columns=Type)
column_proportion=pd.concat((df2['Cloud'],column_proportion),axis=1)
column_proportion
chi2,p_chi2,dof,expected = chi2_contingency(RC1)
We perform a Chi-square test.
print "Chi-square p-value is",p_chi2
The p-values << 1, significant differences exist.
# Expected = row total x column total / all cell total
df_expected=pd.DataFrame(expected,columns=Type)
df_expected=pd.concat((df2['Cloud'],df_expected),axis=1)
print "Expected"
df_expected
Table of expected values: Expected = row total x column total / all cell total
res = RC1-expected
df_res=pd.DataFrame(res,columns=Type)
df_res=pd.concat((df2['Cloud'],df_res),axis=1)
print "Resisuals"
df_res
The residual table is just the contngency table values - the expected values.
# standardized residuals see statisticshowto.com
st_res = res/np.sqrt(expected)
df_st_res=pd.DataFrame(st_res,columns=Type)
df_st_res=pd.concat((df2['Cloud'],df_st_res),axis=1)
print "Standardized residuals"
df_st_res
A couple of remarks can be made from the standardized residual table. The Serpens star-forming region has 7.6 sigma more class O and O-I objects than expected, while it has a strong deficit in class III objects (-3.8). eta Cha shows a siginificant excess of more evolved objects (Transition and class III objects) and a deficit of young class O - O/I objects. The Chamaeleon I and Taurus clouds have many intermediate stage class I-flat and class II objects. The standardized residual table illustrate the disk-formation stage of each cloud.
# adjusted residuals
row_total_proportion = 1.0*m0/m0.sum()
column_total_proportion = 1.0*m1/m1.sum()
adj_res = (RC1-expected)/np.sqrt(expected*
(1.-row_total_proportion)*(1.-column_total_proportion))
df_adj_res=pd.DataFrame(adj_res,columns=Type)
df_adj_res=pd.concat((df2['Cloud'],df_adj_res),axis=1)
print "Adjusted residuals"
df_adj_res
The adjusted residual table is an alternative to the residual table and carries the same information.
One does not necessarily need large amounts of data to perform insightful data analysis.
# sudo pip install FisherExact
from FisherExact import fisher_exact
pvalue = fisher_exact([[8, 2, 12], [1, 5, 2]])
pvalue
df2
df2m = df2[['Cloud','O-O/I','I-flat','II','Trans','III']]
df2m = df2m.set_index('Cloud')
fig=plt.figure(figsize=(9,7))
ax = fig.add_subplot(111)
m, rect = mosaic(df2m.stack(),ax=ax,gap=0.05,statistic=True,axes_label=False,horizontal=False)
plt.show()
df2m