I. Introduction:
As a movie lover, I enjoy watching movie both in China and US. The 2 largest movie markets in the world have in common but still differ a lot. In this project, I investigated the movie rating on IMDb and Douban, 2 professional movie review websites from the US and China. I’m interested in the difference between the favorite movies of American and Chinese. Do Chinese love martial art movie more than American do? Or do American love super hero movie more? Let’s take a look!
Data covers the information of top 250 rated movie in both websites.
Without further ado, let’s get started.
II. Scraping the Websites
In this section, I’ll start from scratch. Skip this part if you are familiar with web scraping.
The basic process is as following:
- Build a robust web scraper
- Scrape douban.com
- Get the urls of movies on the content page
- Enter movie page. Fortunately, douban provide corresponding IMDb link under movie info, so only need to scrape rating on douban. Rest info is scraped on IMDb to keep consistency.
- Enter corresponding IMDb page. Scraped rest info interested. I select the following features: directors, stars, countries origin, genres, runtime.
- Scrape IMDb.com
- Get the urls of movies on the content page
- Enter movie page, get all info. Need to enter another rating page to get detail in rating distribution
1. Build Robust Web Scraper
First, build a robust scrapper that will:
- Randomly choose a header
- Retry specific of times when request fails.
Randomly choosing header helps to pass the anti-scraping by disguising as a user. It is best used with a random IP address, but this feature is still under development. Since this project doesn’t scrape need much data, there is no need to worry.
import requests import random import time class download(): def __init__(self): self.user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] def get(self, url, timeout=None, num_retries=6): UA = random.choice(self.user_agent_list) ##random choose a UA form UAlist headers = {'User-Agent': UA} try: return requests.get(url, headers=headers, timeout=timeout) except: if num_retries > 0: time.sleep(10) print(u'Fail to open website. Retry', num_retries, u'times') return self.get(url, timeout, num_retries-1) else: print(u'Try using Proxy') myrequest = download()
2. Scrape douban.com
The following part contains tons of html parsing. Thanks to beautifulsoap and its powerful CSS selector, it’s all made easy. This is a reference for CSS selector if you are new to it. My interested features of a movie include:
- Title
- Ratings
- Ratings distribution
- Director
- Stars
- Release year
- Runtime
- Genres
- Countries
Every row of data is stored in a dictionary. All dictionaries are appended to a list, which is transformed to a dataframe at the end. The data is stored in .csv for further analysis.
import time from bs4 import BeautifulSoup import pandas as pd import numpy as np from download import myrequest def GetHTML(url): res=myrequest.get(url) if res.status_code==200: soup= BeautifulSoup(res.text,'lxml') print('Successfully Getting URL') return soup print(res.status_code) print('Error!!!!!') return None def _Director(res_IMDB): try: ind = [i for i,j in enumerate(res_IMDB.select('.credit_summary_item h4.inline')) if 'Director' in j.string][0] directors = res_IMDB.select('.credit_summary_item h4.inline')[ind].parent.select('a') return ','.join([i.string for i in directors]) except ValueError: return np.nan # sub function to find first 3 stars: def _Stars(res_IMDB): try: ind = [i.h4.string for i in res_IMDB.select('.credit_summary_item')].index('Stars:') stars = res_IMDB.select('.credit_summary_item')[ind].select('a')[:3] return ','.join([i.string for i in stars]) except ValueError: return np.nan def _Genres(res_IMDB): try: ind = [i.string for i in res_IMDB.select('.see-more.inline.canwrap h4.inline')].index('Genres:') genres= res_IMDB.select('.see-more.inline.canwrap h4.inline')[ind].parent.select('a') return ','.join([i.string for i in genres]) except ValueError: return np.nan def _Countries(res_IMDB): try: ind =[i for i,j in enumerate(res_IMDB.select('#titleDetails .txt-block h4.inline')) if 'Countr' in j.string][0] countries = res_IMDB.select('#titleDetails .txt-block h4.inline')[ind].parent.select('a') return ','.join([i.string for i in countries]) except ValueError: return np.nan def _Runtime(res_IMDB): try: ind =[i.string for i in res_IMDB.select('#titleDetails .txt-block h4.inline')].index('Runtime:') runtime = res_IMDB.select('#titleDetails .txt-block h4.inline')[ind].find_next('time').string return runtime.rstrip(' min') except ValueError: return np.nan def _Budget(res_IMDB): try: ind = [i.string for i in res_IMDB.select('#titleDetails .txt-block h4.inline')].index('Budget:') strs = [i for i in res_IMDB.select('#titleDetails .txt-block h4.inline')[ind].parent.children if isinstance(i,str)] budget = [i for i in strs if re.search('[0-9]',i)][0].strip().replace(',','').replace('\xa0','') return budget except ValueError: return np.nan def _Boxoffice(res_IMDB): try: ind = [i.string for i in res_IMDB.select('#titleDetails .txt-block h4.inline')].index('Cumulative Worldwide Gross:') strs = [i for i in res_IMDB.select('#titleDetails .txt-block h4.inline')[ind].parent.children if isinstance(i,str)] boxoffice = [i for i in strs if re.search('[0-9]',i)][0].strip().replace(',','').replace('\xa0','') return boxoffice except ValueError: return np.nan def GetDoubanInfo(url_douban): res_douban=GetHTML(url_douban) url_IMDB= [i for i in res_douban.select('#info .pl') if 'IMDb' in i.string][0].find_next('a').attrs['href'] res_IMDB = GetHTML(url_IMDB) # parsing and store info movieinfo={'Title' : list(res_IMDB.select('.title_wrapper h1')[0].children)[0].rstrip('\xa0'), 'ReleaseYear': int(res_IMDB.find_all(attrs={'class':'title_wrapper'})[0].h1.span.a.string), 'Director': _Director(res_IMDB), 'Stars':_Stars(res_IMDB), 'Genres': _Genres(res_IMDB), 'Countries':_Countries(res_IMDB), 'Runtime': _Runtime(res_IMDB), 'Rating': float(res_douban.find_all(attrs={'property':'v:average'})[0].string), 'Rating_Per': ','.join([i.string for i in res_douban.find_all(attrs={'class':'rating_per'})]), 'Budget($)': _Budget(res_IMDB), 'BoxOffice($)': _Boxoffice(res_IMDB)} return movieinfo def GetDoubanMovieURL(res): # get all movie url on a content page url=[res.select('#content .grid_view li .hd a')[i]['href'] for i in range(25)] return url if __name__ == '__main__': InfoLst = [] for i in range(10): print('Scrapping page %d'%i) url="https://movie.douban.com/top250?start={0}".format(i*25) res=GetHTML(url) urls=GetDoubanMovieURL(res) for j in range(len(urls)): time.sleep(np.random.uniform(1,2)) print('Number %d in the page'%j) MovieInfo=GetDoubanInfo(urls[j]) print(MovieInfo) InfoLst.append(MovieInfo) df = pd.DataFrame(InfoLst) df.to_csv('Douban.csv',index=False,encoding='utf-8')
3. Scrape IMDb.com
# -*- coding: utf-8 -*- import time import requests from bs4 import BeautifulSoup import pandas as pd import numpy as np from download import myrequest import DoubanIMDB as D def GetIMDBURL(res): urls = ['https://www.imdb.com' + i.attrs['href'] for i in res.select('.lister-list .titleColumn a')] return urls def GetIMDBInfo(url_movie): res_IMDB=D.GetHTML(url_movie) url_rating_page = 'https://www.imdb.com'+res_IMDB.select('.ratings_wrapper .imdbRating a')[0].attrs['href'] res_rating_page = D.GetHTML(url_rating_page) def _rating_per_IMDB(res_rating_page): select = res_rating_page.select('.title-ratings-sub-page table[cellpadding="0"] .allText .topAligned') return ','.join([i.string.strip() for i in select]) movieinfo={'Title' : list(res_IMDB.select('.title_wrapper h1')[0].children)[0].rstrip('\xa0'), 'ReleaseYear': int(res_IMDB.find_all(attrs={'class':'title_wrapper'})[0].h1.span.a.string), 'Director': D._Director(res_IMDB), 'Stars':D._Stars(res_IMDB), 'Genres': D._Genres(res_IMDB), 'Countries':D._Countries(res_IMDB), 'Runtime': D._Runtime(res_IMDB), 'Rating': res_IMDB.select('.ratings_wrapper .ratingValue strong span')[0].string, 'Rating_Per': _rating_per_IMDB(res_rating_page), 'Budget($)': D._Budget(res_IMDB), 'BoxOffice($)': D._Boxoffice(res_IMDB) } return movieinfo if __name__ == '__main__': InfoLst = [] url="https://www.imdb.com/chart/top?ref_=nv_mv_250" res=D.GetHTML(url) urls=GetIMDBURL(res) for i in range(len(urls)): time.sleep(np.random.uniform(1,2)) print('Number %d in the page'%i) MovieInfo=GetIMDBInfo(urls[i]) print(MovieInfo) InfoLst.append(MovieInfo) df = pd.DataFrame(InfoLst) df.to_csv('IMDB.csv',index=False,encoding='utf-8')
III. Data Analysis
Data Summary
Out of top 250 movies, 102 is included in both lists. This is a little bit out of my expectation, since I expected the overlapped ratio was higher. Let’s look closer, and see how they differ with each other.
Most Popular Genres
Drama is the most popular genres in both countries. American love Thriller and Crime movies much more than Chinese do, as I expected. Both countries hold equal mount of love towards Comedy and Adventure movies. Surprisingly, Chinese people like Romance movies better, probably because Chinese culture is more reserved.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from os import path from PIL import Image import matplotlib.image as mpimg from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator from rescale_image import CreateMask %config InlineBackend.figure_format = 'svg' df_imdb = pd.read_csv('IMDB.csv') df_douban = pd.read_csv('douban.csv') title_both = set(df_imdb['Title'])&set(df_douban['Title']) title_imdb = set(df_imdb['Title']) - set(df_douban['Title']) title_douban = set(df_douban['Title']) - set(df_imdb['Title']) mask_both = df_imdb['Title'].isin(title_both) df_both =pd.merge(df_imdb[mask_both].rename(columns={'Rating':'Rating_imdb'}), df_douban.rename(columns={'Rating':'Rating_douban'})[['Title','Rating_douban']], how='left',on = 'Title') mask_imdb = df_imdb['Title'].isin(title_imdb) df_imdb_only = df_imdb[mask_imdb] mask_douban = df_imdb['Title'].isin(title_douban) df_douban_only = df_imdb[mask_douban] map_china = np.array(Image.open("images/China_map.png"))[:,:,3] genres_douban = ' '.join(df_douban['Genres'].tolist()).replace(',', ' ') mask = CreateMask(map_china) wc = WordCloud(background_color="white", max_words=1000, mask=mask, contour_width=3, contour_color='#013243',collocations=False, colormap='tab10') wc.generate(genres_douban) plt.figure(figsize=[8,8]) plt.imshow(wc, interpolation='bilinear') plt.axis("off") map_us = np.array(Image.open("images/US_map.png")) genres_imdb = ' '.join(df_imdb['Genres'].tolist()).replace(',', ' ') mask = CreateMask(map_us) wc = WordCloud(background_color="white", max_words=1000, mask=mask, contour_width=3, contour_color='#013243',collocations=False, colormap='tab10') wc.generate(genres_imdb) plt.figure(figsize=[8,8]) plt.imshow(wc, interpolation='bilinear') plt.axis("off") wc.to_file("images/US.png")
How does ratings look like?
The left figure is the distribution of all movies, clearly ratings in douban.com tend to be higher. However, is this because the difference in movies chosen, or do douban user just give higher ratings?
The figure explains it. It is the distribution of 102 ratings from movies appear in both sites. Chinese users almost gave higher ratings for every movies! It seems they are not as picky as American users.
f, axes = plt.subplots(1, 1, figsize = (7, 7)) sns.set(style="white", palette="muted", color_codes=True) sns.distplot(df_imdb['Rating'], color="m",ax=axes,label='IMDb') sns.distplot(df_douban['Rating'], color="g",ax=axes,label='douban') axes.set_title('Rating Distribution') plt.legend() p = sns.jointplot('Rating_imdb','Rating_douban',data=df_both,xlim=(7.8,10),ylim=(7.8,10), alpha = .6) p.set_axis_labels('IMDb', 'douban') p.ax_joint.plot([0, 10], [0, 10], linewidth=2) p.fig.suptitle('Joint Rating Distribution')
Which countries have more good movies?
Interesting fact: Chinese users love all European and Asian movies more than American users do!
# sns.set_style("whitegrid") count_imdb = pd.Series(','.join(df_imdb['Countries'].tolist()).split(',')).value_counts() count_douban = pd.Series(','.join(df_douban['Countries'].tolist()).split(',')).value_counts() df_count = pd.merge(pd.DataFrame(count_imdb),pd.DataFrame(count_douban), how='inner',left_index=True,right_index=True) df_count.columns=['Counts IMDb','Counts Douban'] mask = set(df_count.sort_values('Counts IMDb',ascending =False).index[:10].tolist()) \ | set(df_count.sort_values('Counts Douban',ascending =False).index[:10].tolist()) df_count_top10 = df_count[df_count.index.isin(mask)].sort_values('Counts Douban',ascending=False) fig, ax = plt.subplots() index = np.arange(len(df_count_top10)) bar_width = 0.35 opacity = 0.7 rects1 = ax.bar(index, df_count_top10['Counts IMDb'], bar_width, alpha=opacity, color='b', label='IMDb') rects2 = ax.bar(index + bar_width, df_count_top10['Counts Douban'], bar_width, alpha=opacity, color='r', label='Douban') ax.set_xlabel('Countries') ax.set_ylabel('Numbers') ax.set_title('Top Ten Countries') ax.set_xticks(index + bar_width / 2) ax.set_xticklabels(tuple(df_count_top10.index.tolist()),rotation=45) ax.legend() fig.tight_layout()