how to make money for amazon reviews
why do people write good reviews but give only 3 stars? why are amazon reviewers leaving 1-star or 5-star reviews on the letters from christopher book? in recent times, mturk has been characterized as a "poorly paid hell," a "digital sweatshop," and a tyrannical platform where workers vie for chump change from billion-dollar companies. "most of it is just filling in empty [space] in the day - time i'd probably be wasting otherwise," he says. "if you're not doing anything anyway, it's just bonus money." uncompromising quality. enduring impact. never miss a story. start your free trial.
ho to make money on tiktok
sentiment analysis in amazon reviews
using probabilistic machine learning
callen rain
swarthmore college
department of computer science
crain1swarthmore.edu
abstract
users of the online shopping site amazon are encouraged to post reviews of the
products that they purchase. little attempt
is made by amazon to restrict or limit the
content of these reviews. the number of
reviews for different products varies, but
the reviews provide accessible and plentiful data for relatively easy analysis for a
range of applications. this paper seeks to
apply and extend the current work in the
field of natural language processing and
sentiment analysis to data retrieved from
amazon. naive bayes and decision list
classifiers are used to tag a given review
as positive or negative. the number of
stars a user gives a product is used as training data to perform supervised machine
learning. a corpus contains 50,000 product review from 15 products serves as the
dataset of study. top selling and reviewed
books on the site are the primary focus
of the experiments, but useful features of
them that aid in accurate classification are
compared to those most useful in classification of other media products. the features, such as bag-of-words and bigrams,
are compared to one another in their effectiveness in correctly tagging reviews. errors in classification and general difficulties regarding the selection of features are
analyzed and discussed.
1 introduction
as the marketplace for consumer products moves
to the internet, the shopping experience changes in
a way that makes much of the information regarding
the use products available online and generated by
users. this contrasts the way that product information used to be disseminated: through word of mouth
and advertising. since its creation as an online bookstore in 1994, amazon.com has grown rapidly and
been a microcosm for user-supplied reviews. soon,
amazon opened its reviews to consumers, and eventually allowed any user to post a review for any one
of the millions of products on the site. with this increase in anonymous user-generated content, efforts
must be made to understand the information in the
correct context, and develop methods to determine
the intent of the author. understanding what online
users think of its content can help a company market
its product as well as mange its online reputation.
the purpose of this paper is to investigate a small
part of this large problem: positive and negative attitudes towards products. sentiment analysis attempts
to determine which features of text are indicative of
it's context (positive, negative, objective, subjective,
etc.) and build systems to take advantage of these
features. the problem of classifying text as positive
or negative is not the whole problem in and of itself,
but it offers a simple enough premise to build upon
further.
much of the work involved in sentiment analysis in content containing personal opinions has
been done relatively recently. pang and lee (2002)
used several machine learning systems to classify
a large carpus of movie reviews. although naive
bayes did not perform the best out of their strategies, it did well compared to the baseline provided
by human-generated classifying words. yessenov
and misailovi (2002) did similar work, taking comments off of social networking sites in relation to
movie reviews, a somewhat more anonymous context. this idea of informality and its effect on sentiment classification has been researched by thelwall
et al. (2010). specifically, they tackled the problem of classifying when slang is used and documents
lack the uniformity of vocabulary and spelling that a
movie review database would contain.
amazon employs a 1-to-5 scale for all products,
regardless of their category, and it becomes challenging to determine the advantages and disadvantes
to different parts of a product. problems with such
rating systems are discussed by hu and liu (2004).
they attempted to categorize opinions about different parts of a product and present these independently to give readers more information that positive/negative sentiment.
2 methodology
amazon reviews are plentiful, but a corpus generated from a the average amazon product is not
generally long enough to perform adequate supervised learning on. i chose to analyze a few highly
reviewed products because the system would have
a higher success rate above the most frequent sense
baseline, making the relative effect of each of the extracted features more apparent. additionally, the logistical questions of selecting truly random product
from such a diverse selection seemed like a problem for another project. i opted to download and
primarily analyze the reviews for the most reviewed
products on the site. most of these are books (harry
potter, actually), and the rest are music cds and
movies. the reviews for these (books and other
media) were downloaded and viewed as two separate datasets. as well as these, i wanted to test
products from a variety of different categories on
the site. i downloaded the reviews for the amazon kindle, which has a comparably large number
compared with the top books and media. then i
downloaded several smaller review sets from products such as levi jeans, kindle fire, bn nook, taylor swift new 'red' album, apple macbook pro, and
bill o'reilly's 'killing lincoln'. relative to other
products on the site, these items have a large number of reviews, but they don't really compare to the
items in the top 100.
2.1 data
downloading and parsing although amazon
does not have an api like twitter to download reviews with, it does have links for every review on every product, so one can technically traverse the site
through product ids. i used two perl scripts written
by andrea esuli to obtain the reviews for the kindle and a few other products. the first script downloads the entire html page for the product and the
second searches the file for information about the review, such as the product id, rating, review date, and
review text.
extracting a review the reviews for a given
product get saved in a text file that is then formatted
into a list of tuples consisting of the review text and
the score given by the reviewer. each of the review
texts is tokenized, and all punctuation except periods, apostrophes, and hyphens is removed. the first
entry in the tuple is this list of tokens. the amazon
website only allows rating from 1 to 5, and using this
rating system to classify texts seemed like it would
give poor results because of the lack of distinction
between reviews receiving similar scores (eg. 4 vs
5). instead, reviews receiving a 1 or 2 star rating
were given a '0' score in the data, whereas reviews
receiving 5s received a score of '1'.
positivity and amazon it may seem odd that 4s
were also not given the positive score, but it remains
a fact that the most purchased products on amazon
receive many, many more positive reviews than negative ones. it made the most sense to obtain the
most reviews that could be interpreted as negative
(1 and 2 stars) and restrict the number of the positive reviews until the two equaled each other and
the most frequent sense baseline equaled 50%. this
does mean that perhaps some of the defining features in the positive reviews were lost while all of
those in the negative ones were preserved, but since
only the most frequent features were retrieved anyways, the crucial features in the positive set were
most liekly preserved. perhaps the best classifying
system will only train on negative reviews, attempt
to generate specific features just for them, and then
tag everything as positive. these observations reveal some weaknesses in the amazon review model,
which seeks to provide an impartial, homogenous
analysis of products for shoppers. this "perfect"
review doesn't really exist, and it certainly doesn't
come from diverse sources. interviews with some
of the top 1000 book reviewers on the site revealed
that approximately 88% said that they write mostly
positive reviews, 70% are male, and half hold a graduate degree (pinch and kesler, 2011). demographics aside, it remains an interesting question if useful
information from reviews can be provided if 95%
of them are positive. comparing different products
(books vs. electronics) of different popularities on
the same simple 1 to 5 scale might not be the best
way to inform shopping choices. there is a decrease
in usefulness when one is comparing several products that have all acheived high acclaim. if both
the nook and kindle e-readers both have thousands
of positive reviews, the most accurate way to describe the comparison is "they are both really good."
there is no efficient way in the current review system to compare different aspects of them, consider
the advantages and disadvantages of each, and make
a more informed decision.
2.2 feature extraction
bag of words a bag of words feature vector consists of all of the words in the article as independent features. in these experiments, all of the words
were added to a list and only the top 2000 most
frequently occurring words were kept. each of the
words in this list was then compared to the words in
the review list, and a dictionary was generated that
mapped each of the features to either true or false,
denoting whether the feature appeared in the review.
this is known as a binary feature vector. the alternate approach would be to collect all of the words
in all of the reviews and obtain counts for them, but
doing this is computationally intensive and did not,
in some small tests, offer more accurate results. it
made more sense to select different, more informative features than to fill the vectors solely with bag
of words features.
collocations since the bag of words feature
model assumes the independence of each of the
words, it ignores some of the relationships between
words that add affect their meanings in the context
of the article. for example, the phrase "low price",
has a different meaning than "low" and "price" appearing independently. these relationships between
words can be captured in the feature vectors for the
articles by including common bigrams as well. using the "bigram" function provided by nltk, all
of the bigrams in each of the articles are saved and
sorted by frequency. the top 500 of them are included in the feature vectors for each of the articles.
handling negation a more specific case of the
issue presented above is the case of negation. words
that occur after a negated word in an article have
opposite meanings than what they originally meant
and present noise in the data if they are saved as unigrams. for example, the phrase "not like" would
indicate a negative review, while "not" and like independently would not necessarily give the same classification. this problem is remedied by adding each
of the words after a negative word as "contains(not
like)". this still does not provide any relation to
the "contains(like)" feature, because it is added separately. it instead serves to make the two features
more useful in classifying, because they will tend to
occur in differently tagged articles.
i found that adding all the words until the next
period did not yield better results than just adding
the next 3 or 4 words. if a reviewer writes "i don't
like the book and i hate the plot!", the "hate" in the
sentence should not be stored as a negation. i found
that storing the three words after the negative word
gave the best accuracy.
spell checking unlike many reviewing sites
whose users are professional or well-known reviewers, the bulk of reviews on amazon are done by
anonymous individuals. this lack of accountability in their writing results in much more frequent
grammatical errors in the reviews. if a user spells a
key word wrong (often "disappointment"), the classifier will ignore the significance of such an important word because if cannot connect it to all of
the other "disappointment"s occurring frequently in
other negative reviews. the aspell module was
used as a spell checker to attempt to resolve some of
these inconsistencies. the spellchecker can check to
see if it contains a given word in its dictionary and it
can suggest a list of words if it cannot find the word.
this list of returned words can be very large, and
it appeared that words that were only slightly misspelled would have a smaller list of suggested words.
spelling was optionally checked as part of the bag of
words extraction. as the program checked to see if
each word in the feature list appeared in the review
document, it checked too see if it had a suggestion
that also appeared in the feature list. if it did, the
spell-checked word are marked as present.
part of speech tags when a user reviews a product, often their adjectives give the best clues to their
opinions. positive and negative reviews regarding
books will contain common nouns appearing often
when anyone talks about books, but there are rarely
specific adjectives that can be used with both classifications. the corpus was tested with only adjectives in the feature sets, and this assumption was
supported.
sentence length after spending time downloading and reading many reviews, i developed the hypothesis that negative reviews would be shorter than
positive ones. after reading some more, it seemed
that they were not always shorter in length, but there
were often short sentences in them. in an attempt to
be blunt and critical, users posting one star reviews
like to keep their sentences short and dramatic. i
tested this assumption by calculating the average
number of words in a sentence in a given document
and looking at the distribution to see if there was a
pattern. i ended up adding a short sentence feature
for documents with average sentence length shorter
than 10 and a similar long sentence one for documents with average sentence length longer than 20.
2.3 classification
naive bayes naive bayes is a simple but robust
classifier applied using bayes' rules that yields very
useful results. it assumes the independence of each
of the features in the vector and for each feature, calculates the probability that it will appear given the
class. the probability of a class given the feature set
is simply the product of the probability that the class
will occur and the probabilities of each of the feature vectors. this process is repeated for each of the
possible classes and the text is classified according
to the maximum probability. the formal definition
is given as:
sˆ = ssargmaxp(s)
yn
j=1
p(fj |s)
where
p(si) = count(si
, wj )
count(wj )
p(fj |s) = count(fj , s)
count(s)
decision list the decision list a rule-based tagger that has the advantage of being human readable.
one of the major problems with naive bayes is the
difficulty in identifying which probabilities are causing certain classifications. one can look at the errors
made, but there's no way to actually know if there is
some feature that is at the root of the problem. the
format of a decision list alleviates this problem by
making the classification rule-based. the classifier
must only determine the existence of the feature at
each level and tag appropriately if the feature is in
the document. this makes is very easy to intensify
which rules the classifier thinks are important, and
aids in the removal of features that are causing inaccurate rules.
3 results
3.1 parameters
before comparing the two algorithms directly, it was
necessary to find appropriate parameters for them.
the systems depend on many subleties of their implementations, but the performances of both the decision list and naive bayes also depend on the number of features that are considered. additionally,
the number of rules that are applied in the decision list before a tag is made will affect its performance. these parameters do also depend on the size
of the test and training sets, but these tests are simply meant to provide an estimate so that the systems
can be tested better in other ways.
it was found that the decision list worked best
with 100 of the top occuring features considered
along a 20 rule limit before the class is picked randomly. naive bayes had the highest accuracy when
it was used with 800 features.
decision list parameters
features rules accuracy
50 2000 0.58914729
90 2000 0.74418605
100 2000 0.7751938
200 2000 0.72093023
500 2000 0.65116279
1000 2000 0.59689922
2000 2000 0.59689922
100 10 0.78294574
100 15 0.78294574
100 20 0.79844961
100 25 0.79069767
100 30 0.78294574
100 35 0.78294574
100 40 0.78294574
naive bayes parameters
features accuracy
500 0.7751938
700 0.80620155
800 0.86821705
900 0.81395349
1000 0.78294574
1500 0.81395349
3.2 naive bayes vs. decision list
decision list vs. nave bayes
data set decision list nave bayes
all-books 0.79844961 0.84496124
kindle 0.74666667 0.84
all-media 0.6828479 0.79935275
the naive bayes classifier performed better than
the decision list classifier with all three of the data
sets. both algorithms performed to the most poorly
with the media dataset. i assume this is because the
data for that corpus is not as specific as the other
two. apart from them also all being books, the three
of the books in the books data set are harry potter
books, so some such features do determine classifications made. the kindle corpus has a similar advantage of being centered around one single product.
3.3 feature analysis
features and naive bayes
feat. books media kindle
bow 0.7829 0.8220 0.8733
bow/neg. 0.7906 0.8446 0.86
bow/num. 0.7829 0.8220 0.8666
bow/sent. len. 0.7906 0.8220 0.8733
bow/coll. 0.7829 0.8155 0.8666
bow/spell 0.7829 0.8155 0.8666
bow/neg./sent. len. 0.7829 0.8187 0.8666
adj./adv. 0.6279 0.6990 0.7466
bag of words ended up being the best feature extraction method that it seemed appropriate to compare all the others against to see if they could raise
the accuracy it achieved. entering words following
negative words as negated features showed an improvement as well. considering the length of sentences in the kindle data set was the only feature that
did not reduce the accuracy from the baseline set by
the unigrams. tagging the parts of speech for the
unigram features and removing all but the adjectives
and adverbs still performed well above the random
most informative bag-of-words features
decision list naive bayes
pos. neg. pos. neg.
easy back carry did
love which perfect sent
reader after awesome connect
great will loves return
really they easy bad
read buy value returned
books screen eyes months
much no love try
reading touch pocket canada
price out lighter support
baseline, which is still significant considering how
many fewer features were in those vectors.
3.4 cross-product results
train on kindle
test reviews dec. list naive bayes
kindle 6316 0.7467 0.84
fire 3198 0.6734 0.8394
nook 182 0.7415 1
macbook 192 0.6379 0.8824
t.swift 210 0.8342 0.6523
levi 573 0.5524 0.7027
lincoln 3895 0.6639 0.7388
here, the classifiers were trained on the kindle
dataset and tested on each of the smaller datasets.
it is interesting to note that moving futher from
the kindle in related categories makes the accuracy
go down. this is because the feature vectors for
the kindle contain words like "screen", "read", and
"lightweight" which would not apply to the taylor
swift album but would still marginally apply to the
macbook and would definitely apply to the nook.
it should also be noted that the datasets with fewer
numbers of reviews are easier to tag. the naive
bayes classifier tagged the nook at 100%, but failed
to tag the taylor swift album at over 66% even
though it was small. to be tagged at the extreme levels of accuracy, a small dataset must be categorically
related to the product that the system was trained on.
4 analysis
finding optimal parameter values for the two classifiers made it apparent how drastically they can
change the accuracy of the algorithm. by having evidence that the number of features is so important, it
also gives a sense of how important the composition
of these vectors is. one thing that i think was missing from my implementation was a greater consideration of how many features from each category i was
using. the bag of words features outperformed the
rest by a large margin. part of this could have been
the fact that i had close to 800 bag of words features
in the vectors for naive bayes, but only as many of
the other features as i had documents. this would
naturally favor the unigram model to be more successful. i could have implemented some of the auxiliary features i was testing with more occurences
per document. for example, i only assigned a sentence length feature occassionally, and when i did, it
was the only one of its kind in the vector. perhaps
it just did not get assigned to vectors enough in the
data to yield measureable results comparable to bag
of words.
regardless of this, bag of words is clearly a robust method of feature extraction. better implementation of the others would have only added to its accuracy. the most frequently used bag-of-words features for each of the classifiers definitely resemble
logical positively and negatively connotated words.
the negative words are definitely less accurate, with
words like "after" and "try" being the worst of them.
they can be seen wokring in some sentences, like "i
try every day to read, but i hate it!", but the words
don't have meanings a strong as some of the others. the fact that the unigram model did do well
means that much of how a user feels can be approximated by considering each of the words that they
write independently of one another. this fact may
not be true of simply positive or negative classification, where more sophisticated methods of feature extraction would have to be pursued, but it does
mean that very accurate systems could be built on a
large scale to detect simple sentiment like this.
the results of the cross-product training and testing can tell us that narrowing training to a specific
category of products will greatly increase its performance. it became clear that training on electronics components and testing on books and clothing
would be very innaccurate on a larger scale. the basic words that users write to explain their simple satisfaction or disatisfaction with a product would yield
a certain amount of success, but in order to gain
any deeper information, one would have to consider
more about the physical properties of the objects and
what a typical user might want to comment on.
5 conclusion
generally, the results of this experiement were very
successful. the classifiers managed to accurately
tag a great amount of user-generated data much
futher past the random baseline of 50%. most
of the new features that i tested were relatively
unsuccessful, but that is most likely due to their
implementation relative to the bag-of-words. i think
the theory behind their use is sound and they could
be implemented together in more successful way in
later experiments.
as mentioned above, this work could be extended
to make the system of numbered star rating more
useful to users. the success of the bag-of-words
feature extraction could be used to make systems
that analyze more diverse sets of data, but it may
have more use in smaller datasets. the systems
performed reasonably well on small data sets even
when they trained and tested on products that were
completely different. this could be applied not to
the testing of different products, but instead to the
testing of different features of a product. something
missing from a quick glance at a product page is the
knowledge of what the best features of that product
are. the classifying systems here could be used to
determine if the screen of the kindle is better than
that of the nook, or which has a nicer keyboard.
these questions are more useful to readers than simple stars, and the necessary features are in the text.
users do reflect on specific components of products
when they review, but that information is lost in a
way when so many are gathered together.
references
m. hu and b. liu. mining and summarizing customer
reviews. in proceedings of the tenth acm sigkdd
international conference on knowledge discovery and
data mining, pages 168–177. acm, 2004.
s. mukherjee and p. bhattacharyya. feature specific
sentiment analysis for product reviews. computational linguistics and intelligent text processing,
pages 475–487, 2012.
b. pang, l. lee, and s. vaithyanathan. thumbs up?:
sentiment classification using machine learning techniques. in proceedings of the acl-02 conference on
empirical methods in natural language processingvolume 10, pages 79–86. association for computational linguistics, 2002.
t. pinch and f. kesler. how aunt ammy gets her free
lunch: a study of the top-thousand customer reviewers
at amazon. com, 2011.
m. thelwall, k. buckley, g. paltoglou, d. cai, and
a. kappas. sentiment strength detection in short informal text. journal of the american society for information science and technology, 61(12):2544–2558,
2010.
k. yessenov and s. misailovic. sentiment analysis of
movie review comments. methodology, pages 1–17,
2009.