`arxiv_astro_summarizer.astroph_summarizer`

Created on Mon May 15 20:10:00 2023

@author: daniel

Module Contents

Classes

Scraper

Class object to scrape, read, and analyze arXiv papers published on a given date.

Functions

`scrape_and_analyze`(start_date_str, end_date_str[, ...])	Scrape and process astrophysics papers from arXiv for a range of dates (Monday to Friday only!).
`replace_astronomical_terms`(text)	Replaces astronomical terms and abbreviations with their expanded forms or corresponding concepts.

class arxiv_astro_summarizer.astroph_summarizer.Scraper(date, user_input=None, path=None, enforce_date=True)[source]

Class object to scrape, read, and analyze arXiv papers published on a given date.

Parameters:

date (str) – The date to consider for scraping, in the following format: YYYY-MM-DD, e.g. ‘2022-05-12’.
user_input (str, optional) – A short excerpt describing the kind of papers the user is interested in. Defaults to None.
path (str, optional) – The path where the file should be saved. Defaults to None, which saves the files to the local home directory.
enforce_date (bool) – If True, papers without the date printed on the first page will be removed. This is an option due to the Defaults to True.

df

DataFrame containing the metadata of the scraped arXiv papers.

Type:: pandas.DataFrame

text

Processed abstract text extracted from a PDF file.

Type:: str

raw_text

Raw abstract text extracted from a PDF file.

Type:: str

filenames

List of saved file names.

Type:: list

Raises:: ValueError – If the input date is not a string or if user_input is not a string when provided.

format_date()[source]

Converts a date string from ‘YYYY-MM-DD’ format to ‘DD Month YYYY’ format. This will be used to keyword search the paper, to ensure that the publication times are consistent with the input date.

Parameters:: None –
Returns:: The date string in ‘DD Month YYYY’ format.
Return type:: str

check_pdf_contains_text(filename, text)[source]

Check if a PDF file contains a specific text.

Parameters:

filename (str) – The path to the PDF file.
text (str) – The text to search for in the PDF.

Returns:

True if the text is found in the PDF, False otherwise.

Return type:

bool

return_arxiv_url(doi)[source]

Returns the https url of an astro-ph paper, given by the following hardcoded format: https://arxiv.org/pdf/{doi} as such, do not include anything other than the numbers!

Returns:: Returns the path to the arxiv pdf (without the pdf extension)

scrape_arxiv()[source]

Compiles the metadata of the arxiv papers uploaded on a given day. Scraping date ranges is not currently supported!

Parameters:: None –
Returns:: ‘id’, ‘title’, ‘categories’, ‘abstract’, ‘doi’, ‘created’, ‘updated’, ‘authors’
Return type:: Pandas dataframe with the following meta-data

save_paper(index)[source]

Function to download and save a pdf given the DOI number. The file is saved to the specified directory, or the local home directory if path is not provided.

Parameters:: index (int or str) – Index of the paper in the meta-data, which is saved as an attribute called df after running the scrape_arxiv() class method. Can also be set to ‘all’, in which case the entire DataFrame will be looped through to save all appropriate papers.
Returns:: No return, the file is either saved to the path or it’s not.

extract_abstract_from_pdf(filename)[source]

Extracts the abstract from a PDF file.

Parameters:: filename (str) – The path to the PDF file.
Returns:: The extracted abstract text.
Return type:: str

process_abstract(replace_astro_terms=True)[source]

Preprocesses the abstract text by applying specific transformations.

Parameters:: text (str) – The input abstract text to be preprocessed.
Returns:: The preprocessed abstract text.
Return type:: str

summarize(max_length=512, min_length=30, do_sample=False)[source]

Summarizes the given document based on the specified text section. This will update the df class attribute to contain only the author and summary.

Parameters:

max_length (int) – The maximum length of the generated summary.
min_length (int) – The minimum length of the generated summary.
do_sample (bool) – Whether to use sampling during the summarization process.

Returns:

None

is_related(summary)[source]

This function is_related calculates the cosine similarity between a user input and a summary text. It uses the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to create TF-IDF matrices for the input and summary. The cosine similarity is then calculated based on these matrices and returned.

Parameters:: summary – The summary text.
Returns:: The cosine similarity score.

remove_irrelevant_papers(similarity_threshold=0)[source]

Removes the papers that have similarity scores less than or equal to the specified similarity_threshold.

Parameters:: similarity_threshold (float) – Papers with similarity scores below this threshold will be deleted. Defaults to 0. Can be set to None to keep all papers.
Returns:: None

arxiv_astro_summarizer.astroph_summarizer.scrape_and_analyze(start_date_str, end_date_str, user_input=None, similarity_threshold=None, path=None, enforce_date=True)[source]

Scrape and process astrophysics papers from arXiv for a range of dates (Monday to Friday only!).

Example

>>> start_date_str = '2023-04-01'
>>> end_date_str = '2023-04-30'
>>> scrape_and_analyze(start_date_str, end_date_str)

Parameters:

start_date_str (str) – The start date in the format ‘YYYY-MM-DD’.
end_date_str (str) – The end date in the format ‘YYYY-MM-DD’.
user_input (str, optional) – A short excerpt describing the kind of papers the user is interested in. Defaults to None.
similarity_threshold (float) – Papers with similarity scores below this threshold will be deleted. Defaults to 0. Can be set to None to keep all papers.
path (str, optional) – The path where the file should be saved. Defaults to None, which saves the files to the local home directory.
enforce_date (bool) – If True, papers without the date printed on the first page will be removed. Defaults to True.

Returns:

None

arxiv_astro_summarizer.astroph_summarizer.replace_astronomical_terms(text)[source]

Replaces astronomical terms and abbreviations with their expanded forms or corresponding concepts.

Parameters:: text (str) – The preprocessed text.
Returns:: The text with replaced astronomical terms.
Return type:: str

arxiv_astro_summarizer.astroph_summarizer

Module Contents

Classes

Functions

`arxiv_astro_summarizer.astroph_summarizer`