arxiv_astro_summarizer.astroph_summarizer

Created on Mon May 15 20:10:00 2023

@author: daniel

Module Contents

Classes

Scraper

Class object to scrape, read, and analyze arXiv papers published on a given date.

Functions

scrape_and_analyze(start_date_str, end_date_str[, ...])

Scrape and process astrophysics papers from arXiv for a range of dates (Monday to Friday only!).

replace_astronomical_terms(text)

Replaces astronomical terms and abbreviations with their expanded forms or corresponding concepts.

class arxiv_astro_summarizer.astroph_summarizer.Scraper(date, user_input=None, path=None, enforce_date=True)[source]

Class object to scrape, read, and analyze arXiv papers published on a given date.

Parameters:
  • date (str) – The date to consider for scraping, in the following format: YYYY-MM-DD, e.g. ‘2022-05-12’.

  • user_input (str, optional) – A short excerpt describing the kind of papers the user is interested in. Defaults to None.

  • path (str, optional) – The path where the file should be saved. Defaults to None, which saves the files to the local home directory.

  • enforce_date (bool) – If True, papers without the date printed on the first page will be removed. This is an option due to the Defaults to True.

df

DataFrame containing the metadata of the scraped arXiv papers.

Type:

pandas.DataFrame

text

Processed abstract text extracted from a PDF file.

Type:

str

raw_text

Raw abstract text extracted from a PDF file.

Type:

str

filenames

List of saved file names.

Type:

list

Raises:

ValueError – If the input date is not a string or if user_input is not a string when provided.

format_date()[source]

Converts a date string from ‘YYYY-MM-DD’ format to ‘DD Month YYYY’ format. This will be used to keyword search the paper, to ensure that the publication times are consistent with the input date.

Parameters:

None

Returns:

The date string in ‘DD Month YYYY’ format.

Return type:

str

check_pdf_contains_text(filename, text)[source]

Check if a PDF file contains a specific text.

Parameters:
  • filename (str) – The path to the PDF file.

  • text (str) – The text to search for in the PDF.

Returns:

True if the text is found in the PDF, False otherwise.

Return type:

bool

return_arxiv_url(doi)[source]

Returns the https url of an astro-ph paper, given by the following hardcoded format: https://arxiv.org/pdf/{doi} as such, do not include anything other than the numbers!

Returns:

Returns the path to the arxiv pdf (without the pdf extension)

scrape_arxiv()[source]

Compiles the metadata of the arxiv papers uploaded on a given day. Scraping date ranges is not currently supported!

Parameters:

None

Returns:

‘id’, ‘title’, ‘categories’, ‘abstract’, ‘doi’, ‘created’, ‘updated’, ‘authors’

Return type:

Pandas dataframe with the following meta-data

save_paper(index)[source]

Function to download and save a pdf given the DOI number. The file is saved to the specified directory, or the local home directory if path is not provided.

Parameters:

index (int or str) – Index of the paper in the meta-data, which is saved as an attribute called df after running the scrape_arxiv() class method. Can also be set to ‘all’, in which case the entire DataFrame will be looped through to save all appropriate papers.

Returns:

No return, the file is either saved to the path or it’s not.

extract_abstract_from_pdf(filename)[source]

Extracts the abstract from a PDF file.

Parameters:

filename (str) – The path to the PDF file.

Returns:

The extracted abstract text.

Return type:

str

process_abstract(replace_astro_terms=True)[source]

Preprocesses the abstract text by applying specific transformations.

Parameters:

text (str) – The input abstract text to be preprocessed.

Returns:

The preprocessed abstract text.

Return type:

str

summarize(max_length=512, min_length=30, do_sample=False)[source]

Summarizes the given document based on the specified text section. This will update the df class attribute to contain only the author and summary.

Parameters:
  • max_length (int) – The maximum length of the generated summary.

  • min_length (int) – The minimum length of the generated summary.

  • do_sample (bool) – Whether to use sampling during the summarization process.

Returns:

None

This function is_related calculates the cosine similarity between a user input and a summary text. It uses the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to create TF-IDF matrices for the input and summary. The cosine similarity is then calculated based on these matrices and returned.

Parameters:

summary – The summary text.

Returns:

The cosine similarity score.

remove_irrelevant_papers(similarity_threshold=0)[source]

Removes the papers that have similarity scores less than or equal to the specified similarity_threshold.

Parameters:

similarity_threshold (float) – Papers with similarity scores below this threshold will be deleted. Defaults to 0. Can be set to None to keep all papers.

Returns:

None

arxiv_astro_summarizer.astroph_summarizer.scrape_and_analyze(start_date_str, end_date_str, user_input=None, similarity_threshold=None, path=None, enforce_date=True)[source]

Scrape and process astrophysics papers from arXiv for a range of dates (Monday to Friday only!).

Example

>>> start_date_str = '2023-04-01'
>>> end_date_str = '2023-04-30'
>>> scrape_and_analyze(start_date_str, end_date_str)
Parameters:
  • start_date_str (str) – The start date in the format ‘YYYY-MM-DD’.

  • end_date_str (str) – The end date in the format ‘YYYY-MM-DD’.

  • user_input (str, optional) – A short excerpt describing the kind of papers the user is interested in. Defaults to None.

  • similarity_threshold (float) – Papers with similarity scores below this threshold will be deleted. Defaults to 0. Can be set to None to keep all papers.

  • path (str, optional) – The path where the file should be saved. Defaults to None, which saves the files to the local home directory.

  • enforce_date (bool) – If True, papers without the date printed on the first page will be removed. Defaults to True.

Returns:

None

arxiv_astro_summarizer.astroph_summarizer.replace_astronomical_terms(text)[source]

Replaces astronomical terms and abbreviations with their expanded forms or corresponding concepts.

Parameters:

text (str) – The preprocessed text.

Returns:

The text with replaced astronomical terms.

Return type:

str