Limitations
Functionality
By design this program queries scientific papers from the arXiv’s Astrophysics (astro-ph) section! The Scraper class can be used to query individual dates only, and the scrape_and_analyze function can be employed to query papers on a continuous date range. Note that the output summary is intended solely for the purpose of calculating the similarity metric, and is therefore not comprehensive. The purpose of this program is to identify papers related to some specified topic(s), and is thus not intended as a general-purpose paper summarizer!
Incorrect Dates
The scraping pipeline may return papers that were initally submitted to the arXiv outside the specified date range. This occurs as the arXiv routinely performs bulk updates which alters the modification times of various submissions. Because of this bulk metadata record updates, the datestamp values may not correspond with the original submission or replacement times for older articles, and may not for newer articles as well due to additional administrative and bibliographic updates!
As a workaround, the program will inspect the top lines of the first page in an attempt to locate the date – if no date is present at the top of the page, it checks the left hand-side where the arXiv record is printed (the vertical text). If the input date is not present in either of these two locations, the paper is removed from the working directory. This procedure eliminates the majority of the false-positives that are scraped, but given the wide variety of article page formats, it’s possible that no extractable date is present on the first page, and therefore papers outside your input date(s) may be saved to your directory.