Astrophysics Dataset Creation

Astrophysics Dataset Creation#

We will be using arXiv Dataset for this tutorial, in particular, we will use the abstracts for all the papers classified under the astrophysics category, i.e., with category value of Astro-ph.

This notebook contains code to create the dataset.

Download the dataset from from above link and store it locally.

from tqdm import tqdm
import pandas as pd
# Edit the file path to the location where you download the archive.zip
file_path = "<add_file_path>/archive.zip"
column_lst = ["id", "title", "abstract"]
category_name = "astro-ph"
chunked_df = pd.read_json(file_path, lines=True, chunksize=10000)
astro_ph_df = []

for chunk in tqdm(chunked_df):
    astro_ph_df.append(
        chunk[chunk.categories.str.contains("astro-ph")][column_lst].reset_index(
            drop=True
        )
    )
242it [01:20,  3.02it/s]
astro_ph_df = pd.concat(astro_ph_df)
astro_ph_df.head()
id title abstract
0 712.2086 On weak and strong magnetohydrodynamic turbulence Recent numerical and observational studies c...
1 712.2103 Hilltop Curvatons We study ``hilltop'' curvatons that evolve o...
2 712.211 Near-field cosmology with the VLT With the arrival of wide-field imagers on me...
3 712.2111 The prototype colliding-wind pinwheel WR 104 Results from the most extensive study of the...
4 712.2116 X-ray spectral evolution of TeV BL Lac objects... Many of the extragalactic sources detected i...
astro_ph_df.shape
(331564, 3)
# Edit the file path to the location where
astro_ph_df.to_pickle("../resources/astro-ph-arXiv-abstracts.pkl")