arxiv_astro_summarizer.astroph_summarizer
Created on Mon May 15 20:10:00 2023
@author: daniel
Module Contents
Classes
Class object to scrape, read, and analyze arXiv papers published on a given date. |
Functions
|
Scrape and process astrophysics papers from arXiv for a range of dates (Monday to Friday only!). |
Replaces astronomical terms and abbreviations with their expanded forms or corresponding concepts. |
- class arxiv_astro_summarizer.astroph_summarizer.Scraper(date, user_input=None, path=None, enforce_date=True)[source]
Class object to scrape, read, and analyze arXiv papers published on a given date.
- Parameters:
date (str) – The date to consider for scraping, in the following format: YYYY-MM-DD, e.g. ‘2022-05-12’.
user_input (str, optional) – A short excerpt describing the kind of papers the user is interested in. Defaults to None.
path (str, optional) – The path where the file should be saved. Defaults to None, which saves the files to the local home directory.
enforce_date (bool) – If True, papers without the date printed on the first page will be removed. This is an option due to the Defaults to True.
- df
DataFrame containing the metadata of the scraped arXiv papers.
- Type:
pandas.DataFrame
- Raises:
ValueError – If the input date is not a string or if user_input is not a string when provided.
- format_date()[source]
Converts a date string from ‘YYYY-MM-DD’ format to ‘DD Month YYYY’ format. This will be used to keyword search the paper, to ensure that the publication times are consistent with the input date.
- Parameters:
None –
- Returns:
The date string in ‘DD Month YYYY’ format.
- Return type:
- return_arxiv_url(doi)[source]
Returns the https url of an astro-ph paper, given by the following hardcoded format: https://arxiv.org/pdf/{doi} as such, do not include anything other than the numbers!
- Returns:
Returns the path to the arxiv pdf (without the pdf extension)
- scrape_arxiv()[source]
Compiles the metadata of the arxiv papers uploaded on a given day. Scraping date ranges is not currently supported!
- Parameters:
None –
- Returns:
‘id’, ‘title’, ‘categories’, ‘abstract’, ‘doi’, ‘created’, ‘updated’, ‘authors’
- Return type:
Pandas dataframe with the following meta-data
- save_paper(index)[source]
Function to download and save a pdf given the DOI number. The file is saved to the specified directory, or the local home directory if path is not provided.
- Parameters:
index (int or str) – Index of the paper in the meta-data, which is saved as an attribute called
dfafter running the scrape_arxiv() class method. Can also be set to ‘all’, in which case the entire DataFrame will be looped through to save all appropriate papers.- Returns:
No return, the file is either saved to the path or it’s not.
- process_abstract(replace_astro_terms=True)[source]
Preprocesses the abstract text by applying specific transformations.
- summarize(max_length=512, min_length=30, do_sample=False)[source]
Summarizes the given document based on the specified text section. This will update the
dfclass attribute to contain only the author and summary.
This function is_related calculates the cosine similarity between a user input and a summary text. It uses the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to create TF-IDF matrices for the input and summary. The cosine similarity is then calculated based on these matrices and returned.
- Parameters:
summary – The summary text.
- Returns:
The cosine similarity score.
- remove_irrelevant_papers(similarity_threshold=0)[source]
Removes the papers that have similarity scores less than or equal to the specified similarity_threshold.
- Parameters:
similarity_threshold (float) – Papers with similarity scores below this threshold will be deleted. Defaults to 0. Can be set to None to keep all papers.
- Returns:
None
- arxiv_astro_summarizer.astroph_summarizer.scrape_and_analyze(start_date_str, end_date_str, user_input=None, similarity_threshold=None, path=None, enforce_date=True)[source]
Scrape and process astrophysics papers from arXiv for a range of dates (Monday to Friday only!).
Example
>>> start_date_str = '2023-04-01' >>> end_date_str = '2023-04-30' >>> scrape_and_analyze(start_date_str, end_date_str)
- Parameters:
start_date_str (str) – The start date in the format ‘YYYY-MM-DD’.
end_date_str (str) – The end date in the format ‘YYYY-MM-DD’.
user_input (str, optional) – A short excerpt describing the kind of papers the user is interested in. Defaults to None.
similarity_threshold (float) – Papers with similarity scores below this threshold will be deleted. Defaults to 0. Can be set to None to keep all papers.
path (str, optional) – The path where the file should be saved. Defaults to None, which saves the files to the local home directory.
enforce_date (bool) – If True, papers without the date printed on the first page will be removed. Defaults to True.
- Returns:
None