Astrophysics Dataset Creation#
We will be using arXiv Dataset for this tutorial, in particular, we will use the abstracts for all the papers classified under the astrophysics category, i.e., with category value of Astro-ph.
This notebook contains code to create the dataset.
Download the dataset from from above link and store it locally.
from tqdm import tqdm
import pandas as pd
# Edit the file path to the location where you download the archive.zip
file_path = "<add_file_path>/archive.zip"
column_lst = ["id", "title", "abstract"]
category_name = "astro-ph"
chunked_df = pd.read_json(file_path, lines=True, chunksize=10000)
astro_ph_df = []
for chunk in tqdm(chunked_df):
astro_ph_df.append(
chunk[chunk.categories.str.contains("astro-ph")][column_lst].reset_index(
drop=True
)
)
242it [01:20, 3.02it/s]
astro_ph_df = pd.concat(astro_ph_df)
astro_ph_df.head()
| id | title | abstract | |
|---|---|---|---|
| 0 | 712.2086 | On weak and strong magnetohydrodynamic turbulence | Recent numerical and observational studies c... |
| 1 | 712.2103 | Hilltop Curvatons | We study ``hilltop'' curvatons that evolve o... |
| 2 | 712.211 | Near-field cosmology with the VLT | With the arrival of wide-field imagers on me... |
| 3 | 712.2111 | The prototype colliding-wind pinwheel WR 104 | Results from the most extensive study of the... |
| 4 | 712.2116 | X-ray spectral evolution of TeV BL Lac objects... | Many of the extragalactic sources detected i... |
astro_ph_df.shape
(331564, 3)
# Edit the file path to the location where
astro_ph_df.to_pickle("../resources/astro-ph-arXiv-abstracts.pkl")