Loading Data from the Web

1 minute read

Published:

Many of Pandas read functions have the ability to read files from the internet. This can be helpful you’re writing tutorials or examples and want them to work even if users don’t have the data. Or if you want to quickly look at data without downloading it.

Here we have a brief example using Pandas to remotely download NASA OMNI solar wind and geomagnetic data.

Lets start with loading the required packages (only Pandas and os!) and setting up a dictionary which stores the column names for the 1-minute OMNI data.

import os
import pandas as pd

cols = {'Year':None, 'DOY':None, 'Hour':None, 'Minute':None,
            'IMF_id':99, 'SW_id':99, 'IMF_pt':999, 'SW_pt':999,
            'Per_int':999, 'Timeshift':999999, 'RMS_Timeshift':999999,
            'RMS_PhaseFrontNormal':99.99, 'Time_btwn_observations':999999,
            'B':9999.99, 'Bx_GSEGSM':9999.99, 'By_GSE':9999.99, 'Bz_GSE':9999.99,
            'By_GSM':9999.99, 'Bz_GSM':9999.99, 'RMS_SD_B':9999.99,
            'RMS_SD_field_vector':9999.99, 'Vsw':99999.9, 'Vx_GSE':99999.9,
            'Vy_GSE':99999.9, 'Vz_GSE':99999.9, 'Prho':999.99,'Tp':9999999.,
            'dynP':99.99, 'Esw':999.99, 'Beta':999.99, 'AlfvenMach':999.9,
            'X(s/c), GSE':9999.99, 'Y(s/c), GSE':9999.99, 'Z(s/c), GSE':9999.99,
            'BSN location, Xgse':9999.99, 'BSN location, Ygse':9999.99, 'BSN location, Zgse':9999.99,
            'AE':99999, 'AL':99999, 'AU':99999, 'SYM_D index':99999, 'SYM_H index':99999, 
            'ASY_D index':99999, 'ASY_H index':99999, 'PC index':999.99, 'Na_Np Ratio':9.999,
            'MagnetosonicMach':99.9}

Set up the file structure which sets the http web addresses to OMNI files

http_path = 'https://spdf.gsfc.nasa.gov/pub/data/omni/high_res_omni/'
sdate = pd.to_datetime('2000')
edate = pd.to_datetime('2010')
d_ser = pd.date_range(start=sdate, end=edate, freq='12MS')
fn = [os.path.join(http_path,f'omni_min{x.year}.asc') for x in d_ser]

Use Pandas concat, read_csv, and list comprehension to read all the datafiles and join them into a single DataFrame.

om_dat = pd.concat((pd.read_csv(f,sep='\s+', engine='python', names=list(cols.keys()), 
                    header=None, on_bad_lines='skip') for f in fn), ignore_index=True)

Create a datetime column (cause I like datetime).

dates=['Year','DOY','Hour','Minute']
dv = om_dat.loc[:,dates].astype('int32')
dt = [f"{x['Year']:04}{x['DOY']:03}{x['Hour']:02}{x['Minute']:02}" for index, x in dv.iterrows()]

om_dat['DateTime'] = pd.to_datetime(dt, format='%Y%j%H%M')

Finally use the cols dictionary to replace bad data in the OMNI dataset

for k, kval in cols.items():
    if kval is None:
        continue
    om_dat.loc[om_dat[k] == kval, k] = np.nan