Amazon offers no more a simple public API to access its reviews. If ones wants to quickly build a dataset of Amazon reviews he/she has to download them using a web crawler on the HTML pages.
I wrote a Perl script that, given a list of IDs of Amazon products, automatically downloads all and only the HTML pages that contain the reviews about that products.
Then I wrote another Perl script that, given a list of the downloaded HTML files, extract all the reviews contained in them, outputting for each review a record with the following information:
- A counter of the extracted reviews so far (can be used as a unique ID for the dataset).
- Date of the review in YYYYMMDD format.
- ID of the reviewed product.
- Star rating assigned by the reviewer.
- Date of the review in human readable format.
- ID of the author of the review.
- Title of the review
- Content of the review
Example
Given the products with IDs, e.g., B0040JHVCC and B00004ZDB1, the reviews are downloaded with the command:
./downloadAmazonReviews.pl B0040JHVCC B00004ZDB1
Reviews are automatically downloaded in the ./amazonreviews/B0040JHVCC and ./amazonreviews/B00004ZDB1 directory. The script automatically adapts a timeout between download requests in order to be polite with Amazon, and also retries failing downloads (503 errors) until every page is downloaded.
Then the reviews are extracted from the HTML file by issuing the command:
./extractAmazonReviews.pl ./amazonreviews/B0040JHVCC/* ./amazonreviews/B00004ZDB1/*
The scripts outputs one review per line on the standard output, in a CSV format:
"0","20120118","B0040JHVCC","1.0","January 18, 2012","A1E5SQ7VA3I8OI","Not worth the price","I purchased the... [removed for brevity] ...and definitely no." "1","20120116","B0040JHVCC","5.0","January 16, 2012","A34FBZLFAU88UI","Compact version of the 7D, and neck and neck with D7000","This camera is... [removed for brevity] ...focus hunting issues." "2",...
Disclaimer
I provide you the tool to download the reviews, not the right to download them. You have to respect Amazon’s rights on its own data. Do not release the data you download without Amazon’s consent.
Download
Here are the two scripts. If you fix any bug or improve them please share your code.
Andrea thanks for the scripts. The first one worked flawlessly. However when I run the second it doesn’t seem to generate the cvs file as intended. Not sure what I’m doing wrong as i’m following your instructions to the best of my abilities.
Again, thanks for the handy scripts.
Zac
got it. you have to put the number in place of the “*” and do it file by file. I was hoping it would lump all of the output files into one cvs file.
Thanks. again.
Zac