Amazon reviews downloader and parser

Amazon offers no more a simple public API to access its reviews. If ones wants to quickly build a dataset of Amazon reviews he/she has to download them using a web crawler on the HTML pages.

I wrote a Perl script that, given a first level domain (e.g., “com”, “it”) and list of IDs of Amazon products, automatically downloads from the Amazon server that dedicated to that domain all and only the HTML pages that contain the reviews about that products.

Then I wrote another Perl script that, given a list of the downloaded HTML files, extract all the reviews contained in them, outputting for each review a record with the following information:

  • A counter of the extracted reviews so far (can be used as a unique ID for the dataset).
  • Date of the review in YYYYMMDD format (note: on non-English speaking domains this feature won’t work, edit the script to set the name of months in the desired language).
  • ID of the reviewed product.
  • Star rating assigned by the reviewer.
  • Date of the review in human readable format (will be in the language used by the specified domain).
  • ID of the author of the review.
  • Title of the review
  • Content of the review

Example

Given the products with IDs, e.g., B0040JHVCC and B00004ZDB1, the reviews from the “.com” domain are downloaded with the command:

./downloadAmazonReviews.pl com B0040JHVCC B00004ZDB1

Reviews are automatically downloaded in the ./amazonreviews/com/B0040JHVCC and ./amazonreviews/com/B00004ZDB1 directory. The script automatically adapts a timeout between download requests in order to be polite with Amazon, and also retries failing downloads (503 errors) until every page is downloaded.

Then the reviews are extracted from the HTML file by issuing the command:

./extractAmazonReviews.pl ./amazonreviews/com/B0040JHVCC/* ./amazonreviews/com/B00004ZDB1/*

The scripts outputs one review per line on the standard output, in a CSV format:

"0","20120118","B0040JHVCC","1.0","January 18, 2012","A1E5SQ7VA3I8OI","Not worth the price","I purchased the... [removed for brevity]  ...and definitely no."
"1","20120116","B0040JHVCC","5.0","January 16, 2012","A34FBZLFAU88UI","Compact version of the 7D, and neck and neck with D7000","This camera is... [removed for brevity] ...focus hunting issues."
"2",...

Disclaimer

I provide you the tool to download the reviews, not the right to download them. You have to respect Amazon’s rights on its own data. Do not release the data you download without Amazon’s consent.

Download

Here are the two scripts. If you fix any bug or improve them please share your code.

History

November, 2012: added the possibility to download reviews from many Amazon domains, not just the com one (last time tested it was working on European languages and Chinese, not on Japanese).

28 thoughts on “Amazon reviews downloader and parser

  1. Andrea thanks for the scripts. The first one worked flawlessly. However when I run the second it doesn’t seem to generate the cvs file as intended. Not sure what I’m doing wrong as i’m following your instructions to the best of my abilities.

    Again, thanks for the handy scripts.

    Zac

  2. got it. you have to put the number in place of the “*” and do it file by file. I was hoping it would lump all of the output files into one cvs file.

    Thanks. again.

    Zac

    • Are you on Windows? I have found that windows shell behaves differently from the bash shell I use, and it does not expand the ‘dir/*’ in the list of files in ‘dir’, so you have to list all the files by hand in the command.

      On windows I use win-bash (http://sourceforge.net/projects/win-bash/), which gives me the full power of bash and shell commands to avoid the issues related to the poor windows shell.

  3. hi Andrea, thanks for the scripts, it’s really helpful. I have no knowledge on perl but i’m gonna use (and cite) this script for my working paper related to amazon review. Could you please advise how to extract helpful vote (e.g: 5 of 5 people found the following review helpful) ? thanks a lot in advance

  4. when i run the extract script, nothing happens. do you know what i might be doing wrong?

    .\extractAmazonReviews.pl .\amazonreviews\BBBBBBBBB\1

    • Hello Andrea,
      Thank you very much for your scripts.
      I have executed your scripts, which were all successful. But now I am wondering how can get “a simple CSV to standard output” from the extractAmazonReviews.pl?

      • You can simply use Josh’s script (from the above pastebin link) and modify it to print to the console instead of printing to a file (line 56-58).

      • Thank you so much Andrea. The problem was the firewall on my computer. as soon as I disable it it worked perfectly.

      • Another question is that, for the extraction command: C:\amazonDownload>perl extractAmazonReviews.pl amazonreviews\B005FIWTMY\1 amazonreviews\B005FIWTMY\2 amazonreviews\B005FIWTMY\3
        . If there are many more ,saying 45, so do I have to write all of them in the command line such as amazonreviews\B005FIWTMY\4 amazonreviews\B005FIWTMY\5 amazonreviews\B005FIWTMY\6
        ………which seems cumbersome.

        Thanks,

  5. Andrea, I have another question. Have you noticed the new summaries that amazon shows to its customers. It is like the best 3 sentences out of all reviews. It is possible that get those with your code? Thanks

    • Thank you for pointing out this new feature, it is interesting.
      From a quick look, the sentences can be retrieved by downloading the http://www.amazon.com/dp/productID/ page, and then extracting the text from the <span id=”advice-quote-[012]” spans.

  6. Andrea – this has been working great for me after i had a few problems a couple of months ago. Thank you so much! This has been very beneficial to me and has tremendously increased my productivity, and saved me lot of time.

    I am not a developer, so I run your script as-is. I was wondering if it can be enhanced to also pull the “number of people that found this useful” data?

  7. Hi, Andrea:

    It’s really a great tool.
    But I find that if a product doesn’t have any reviews, some error will happen.

  8. hello andrea..

    I use your perl code and run it in windows 7,but when I run on CMD,nothing happens, yet the amazonreviews folder and the product code is created,but nothing inside..did I do something wrong? I have strawberry perl installed

    • What can I say… Probably you have a minor error on the command line, or in the perl configuration, or the perl version you use differs from mine (though, being this a simple script it should run smoothly across different perl versions).
      It is a very simple script, I advise you to play with it (e.g., just adding print commands to trace the execution) trying to discover the issue. The time spent on it to make it work will be a useful experience.

      • my bad,i forgot to put ‘com’ in the command prompt,as simple as that..now your script runs perfectly on my machine,well i’m trying to find a way to extract all the crawled file because ‘*’ didn’t work on windows

        thank you very much andrea :)

  9. Thanks Andrea – LOVE these scripts. They work 99% perfect for me, but I have noticed that if a reviewer posts a video then the script will not work for that review and one or two after, before working correctly again. For the ones it doesn’t work right on, the script returns some tags and some other blanks lines etc. I am attempting to fix but wondered if you had noticed that and figured it out already. Again thanks so much for sharing your really valuable work here.

Leave a Reply

Your email address will not be published.


− 5 = one

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>