Amazon offers no more a simple public API to access its reviews. If ones wants to quickly build a dataset of Amazon reviews he/she has to download them using a web crawler on the HTML pages.
I wrote a Perl script that, given a first level domain (e.g., “com”, “it”) and list of IDs of Amazon products, automatically downloads from the Amazon server that dedicated to that domain all and only the HTML pages that contain the reviews about that products.
Then I wrote another Perl script that, given a list of the downloaded HTML files, extract all the reviews contained in them, outputting for each review a record with the following information:
- A counter of the extracted reviews so far (can be used as a unique ID for the dataset).
- Date of the review in YYYYMMDD format (note: on non-English speaking domains this feature won’t work, edit the script to set the name of months in the desired language).
- ID of the reviewed product.
- Star rating assigned by the reviewer.
- Date of the review in human readable format (will be in the language used by the specified domain).
- ID of the author of the review.
- Title of the review
- Content of the review
Example
Given the products with IDs, e.g., B0040JHVCC and B00004ZDB1, the reviews from the “.com” domain are downloaded with the command:
./downloadAmazonReviews.pl com B0040JHVCC B00004ZDB1
Reviews are automatically downloaded in the ./amazonreviews/com/B0040JHVCC and ./amazonreviews/com/B00004ZDB1 directory. The script automatically adapts a timeout between download requests in order to be polite with Amazon, and also retries failing downloads (503 errors) until every page is downloaded.
Then the reviews are extracted from the HTML file by issuing the command:
./extractAmazonReviews.pl ./amazonreviews/com/B0040JHVCC/* ./amazonreviews/com/B00004ZDB1/*
The scripts outputs one review per line on the standard output, in a CSV format:
"0","20120118","B0040JHVCC","1.0","January 18, 2012","A1E5SQ7VA3I8OI","Not worth the price","I purchased the... [removed for brevity] ...and definitely no." "1","20120116","B0040JHVCC","5.0","January 16, 2012","A34FBZLFAU88UI","Compact version of the 7D, and neck and neck with D7000","This camera is... [removed for brevity] ...focus hunting issues." "2",...
Disclaimer
I provide you the tool to download the reviews, not the right to download them. You have to respect Amazon’s rights on its own data. Do not release the data you download without Amazon’s consent.
Download
Here are the two scripts. If you fix any bug or improve them please share your code.
History
November, 2012: added the possibility to download reviews from many Amazon domains, not just the com one (last time tested it was working on European languages and Chinese, not on Japanese).
Andrea thanks for the scripts. The first one worked flawlessly. However when I run the second it doesn’t seem to generate the cvs file as intended. Not sure what I’m doing wrong as i’m following your instructions to the best of my abilities.
Again, thanks for the handy scripts.
Zac
got it. you have to put the number in place of the “*” and do it file by file. I was hoping it would lump all of the output files into one cvs file.
Thanks. again.
Zac
Are you on Windows? I have found that windows shell behaves differently from the bash shell I use, and it does not expand the ‘dir/*’ in the list of files in ‘dir’, so you have to list all the files by hand in the command.
On windows I use win-bash (http://sourceforge.net/projects/win-bash/), which gives me the full power of bash and shell commands to avoid the issues related to the poor windows shell.
hi Andrea, thanks for the scripts, it’s really helpful. I have no knowledge on perl but i’m gonna use (and cite) this script for my working paper related to amazon review. Could you please advise how to extract helpful vote (e.g: 5 of 5 people found the following review helpful) ? thanks a lot in advance
when i run the extract script, nothing happens. do you know what i might be doing wrong?
.\extractAmazonReviews.pl .\amazonreviews\BBBBBBBBB\1
btw, i am running this from the windows 7 command prompt
The BBBBBBBBB product you mention in the comment does not exist on amazon. Have you tried with an existing code, e.g., B00009R6WT?
that was just a dummy product i entered. i am actually looking to download reviews for B005FIWTMY. i can run the first script successfully, but the second one does not run at all.
On my PC the two scripts work. The first command is:
C:\amazonDownload>perl downloadAmazonReviews.pl B005FIWTMY
And the output is:
http://www.amazon.com/product-reviews/B005FIWTMY/?ie=UTF8&showViewpoints=0&pageNumber=1&sortBy=bySubmissionDateDescending GOTIT
ok B005FIWTMY 1 1
http://www.amazon.com/product-reviews/B005FIWTMY/?ie=UTF8&showViewpoints=0&pageNumber=2&sortBy=bySubmissionDateDescending GOTIT
ok B005FIWTMY 2 3
http://www.amazon.com/product-reviews/B005FIWTMY/?ie=UTF8&showViewpoints=0&pageNumber=3&sortBy=bySubmissionDateDescending GOTIT
ok B005FIWTMY 3 3
The second command is:
C:\amazonDownload>perl extractAmazonReviews.pl amazonreviews\B005FIWTMY\1 amazonreviews\B005FIWTMY\2 amazonreviews\B005FIWTMY\3
And the first line of the output is:
“0″,”20120702″,”B005FIWTMY”,”5.0″,”July 2, 2012″,”AULR60G25AF0B”,”QB Premier 2012″,”Quickbook Premier 2012 is an upgrade for our 2009 version. I love this version as it has everything we need for our business. Highly recommend it!!”
The output is 30 lines in total.
Do you get an error message?
Really awesome code Andrea
I edited it slightly so it fetched from Amazon UK, didn’t keep the user data and would dump the lot to a csv file.
http://pastebin.com/MEKL1zUc
Hello Andrea,
Thank you very much for your scripts.
I have executed your scripts, which were all successful. But now I am wondering how can get “a simple CSV to standard output” from the extractAmazonReviews.pl?
You can simply use Josh’s script (from the above pastebin link) and modify it to print to the console instead of printing to a file (line 56-58).
Hello Andrea,
Thank you so much for your sharing knowledge. I am working on Windows XP and when I run
“perl downloadAmazonReviews.pl B005FIWTMY”
The output is :
http://www.amazon.com/product-reviews/B005FIWTMY/?ie=UTF8&showViewpoints=0&pageNumber=1&sortBy=bySubmissionDateDescending ERROR500
Error 500 means “Internal Server Error” (see http://en.wikipedia.org/wiki/List_of_HTTP_status_codes ). Maybe it was a temporary problem on some Amazon server, retry a few times. Right now I’m able to download the reviews using the script.
Can you access the webpage on your usual browser using the URL printed by the script?
Thank you so much Andrea. The problem was the firewall on my computer. as soon as I disable it it worked perfectly.
Another question is that, for the extraction command: C:\amazonDownload>perl extractAmazonReviews.pl amazonreviews\B005FIWTMY\1 amazonreviews\B005FIWTMY\2 amazonreviews\B005FIWTMY\3
. If there are many more ,saying 45, so do I have to write all of them in the command line such as amazonreviews\B005FIWTMY\4 amazonreviews\B005FIWTMY\5 amazonreviews\B005FIWTMY\6
………which seems cumbersome.
Thanks,
This is a limitation of the windows command shell. I use the bash shell compiled for windows http://win-bash.sourceforge.net/
Andrea, I have another question. Have you noticed the new summaries that amazon shows to its customers. It is like the best 3 sentences out of all reviews. It is possible that get those with your code? Thanks
Thank you for pointing out this new feature, it is interesting.
From a quick look, the sentences can be retrieved by downloading the http://www.amazon.com/dp/productID/ page, and then extracting the text from the <span id=”advice-quote-[012]” spans.
Andrea – this has been working great for me after i had a few problems a couple of months ago. Thank you so much! This has been very beneficial to me and has tremendously increased my productivity, and saved me lot of time.
I am not a developer, so I run your script as-is. I was wondering if it can be enhanced to also pull the “number of people that found this useful” data?
if i want to use the first code on windows then what changes should i make to this code?
The code should work fine on Windows, provided that you have perl installed.
Hi, Andrea:
It’s really a great tool.
But I find that if a product doesn’t have any reviews, some error will happen.
e.g. “B0085AE4SQ”,”B009WQP6U2″,…
hello andrea..
I use your perl code and run it in windows 7,but when I run on CMD,nothing happens, yet the amazonreviews folder and the product code is created,but nothing inside..did I do something wrong? I have strawberry perl installed
What can I say… Probably you have a minor error on the command line, or in the perl configuration, or the perl version you use differs from mine (though, being this a simple script it should run smoothly across different perl versions).
It is a very simple script, I advise you to play with it (e.g., just adding print commands to trace the execution) trying to discover the issue. The time spent on it to make it work will be a useful experience.
my bad,i forgot to put ‘com’ in the command prompt,as simple as that..now your script runs perfectly on my machine,well i’m trying to find a way to extract all the crawled file because ‘*’ didn’t work on windows
thank you very much andrea
Thanks Andrea – LOVE these scripts. They work 99% perfect for me, but I have noticed that if a reviewer posts a video then the script will not work for that review and one or two after, before working correctly again. For the ones it doesn’t work right on, the script returns some tags and some other blanks lines etc. I am attempting to fix but wondered if you had noticed that and figured it out already. Again thanks so much for sharing your really valuable work here.