Favorite Movies between Chinese and American -Scrapping IMDb

I. Introduction:

As a movie lover, I enjoy watching movie both in China and US. The 2 largest movie markets in the world have in common but still differ a lot. In this project, I investigated the movie rating on IMDb and Douban, 2 professional movie review websites from the US and China. I’m interested in the difference between the favorite movies of American and Chinese. Do Chinese love martial art movie more than American do? Or do American love super hero movie more? Let’s take a look!

Data covers the information of top 250 rated movie in both websites.

Without further ado, let’s get started.

II. Scraping the Websites

In this section, I’ll start from scratch. Skip this part if you are familiar with web scraping.

The basic process is as following:

  1. Build a robust web scraper
  2. Scrape douban.com
    • Get the urls of movies on the content page
    • Enter movie page. Fortunately, douban provide corresponding IMDb link under movie info, so only need to scrape rating on douban. Rest info is scraped on IMDb to keep consistency.
    • Enter corresponding IMDb page. Scraped rest info interested. I select the following features: directors, stars, countries origin, genres, runtime.
  3. Scrape IMDb.com
    • Get the urls of movies on the content page
    • Enter movie page, get all info. Need to enter another rating page to get detail in rating distribution

1. Build Robust Web Scraper

First, build a robust scrapper that will:

  1. Randomly choose a header
  2. Retry  specific of times when request fails.

Randomly choosing header helps to pass the anti-scraping by disguising as a user. It is best used with a random IP address, but this feature is still under development. Since this project doesn’t scrape need much data, there is no need to worry.

import requests
import random
import time

class download():

def __init__(self):

self.user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

def get(self, url, timeout=None, num_retries=6):
UA = random.choice(self.user_agent_list) ##random choose a UA form UAlist
headers = {'User-Agent': UA}

try:
return requests.get(url, headers=headers, timeout=timeout)
except:
if num_retries > 0:
time.sleep(10)
print(u'Fail to open website. Retry', num_retries, u'times')
return self.get(url, timeout, num_retries-1)
else:
print(u'Try using Proxy')

myrequest = download()

2. Scrape douban.com

The following part contains tons of html parsing.  Thanks to beautifulsoap and its powerful CSS selector, it’s all made easy. This is a reference for CSS selector if you are new to it. My interested features of a movie include:

  • Title
  • Ratings
  • Ratings distribution
  • Director
  • Stars
  • Release year
  • Runtime
  • Genres
  • Countries

Every row of data is stored in a dictionary. All dictionaries are appended to a list, which is transformed to a dataframe at the end. The data is stored in .csv for further analysis.

import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from download import myrequest

def GetHTML(url):
res=myrequest.get(url)
if res.status_code==200:
soup= BeautifulSoup(res.text,'lxml')
print('Successfully Getting URL')
return soup

print(res.status_code)
print('Error!!!!!')
return None

def _Director(res_IMDB):
try:
ind = [i for i,j in enumerate(res_IMDB.select('.credit_summary_item h4.inline')) if 'Director' in j.string][0]
directors = res_IMDB.select('.credit_summary_item h4.inline')[ind].parent.select('a')
return ','.join([i.string for i in directors])
except ValueError:
return np.nan

# sub function to find first 3 stars:
def _Stars(res_IMDB):
try:
ind = [i.h4.string for i in res_IMDB.select('.credit_summary_item')].index('Stars:')
stars = res_IMDB.select('.credit_summary_item')[ind].select('a')[:3]
return ','.join([i.string for i in stars])
except ValueError:
return np.nan

def _Genres(res_IMDB):
try:
ind = [i.string for i in res_IMDB.select('.see-more.inline.canwrap h4.inline')].index('Genres:')
genres= res_IMDB.select('.see-more.inline.canwrap h4.inline')[ind].parent.select('a')
return ','.join([i.string for i in genres])
except ValueError:
return np.nan

def _Countries(res_IMDB):
try:
ind =[i for i,j in enumerate(res_IMDB.select('#titleDetails .txt-block h4.inline')) if 'Countr' in j.string][0]
countries = res_IMDB.select('#titleDetails .txt-block h4.inline')[ind].parent.select('a')
return ','.join([i.string for i in countries])
except ValueError:
return np.nan

def _Runtime(res_IMDB):
try:
ind =[i.string for i in res_IMDB.select('#titleDetails .txt-block h4.inline')].index('Runtime:')
runtime = res_IMDB.select('#titleDetails .txt-block h4.inline')[ind].find_next('time').string
return runtime.rstrip(' min')
except ValueError:
return np.nan
def _Budget(res_IMDB):
try:
ind = [i.string for i in res_IMDB.select('#titleDetails .txt-block h4.inline')].index('Budget:')
strs = [i for i in res_IMDB.select('#titleDetails .txt-block h4.inline')[ind].parent.children if isinstance(i,str)]
budget = [i for i in strs if re.search('[0-9]',i)][0].strip().replace(',','').replace('\xa0','')
return budget
except ValueError:
return np.nan
def _Boxoffice(res_IMDB):
try:
ind = [i.string for i in res_IMDB.select('#titleDetails .txt-block h4.inline')].index('Cumulative Worldwide Gross:')
strs = [i for i in res_IMDB.select('#titleDetails .txt-block h4.inline')[ind].parent.children if isinstance(i,str)]
boxoffice = [i for i in strs if re.search('[0-9]',i)][0].strip().replace(',','').replace('\xa0','')
return boxoffice
except ValueError:
return np.nan
def GetDoubanInfo(url_douban):
res_douban=GetHTML(url_douban)
url_IMDB= [i for i in res_douban.select('#info .pl') if 'IMDb' in i.string][0].find_next('a').attrs['href']
res_IMDB = GetHTML(url_IMDB)
# parsing and store info
movieinfo={'Title' : list(res_IMDB.select('.title_wrapper h1')[0].children)[0].rstrip('\xa0'),
'ReleaseYear': int(res_IMDB.find_all(attrs={'class':'title_wrapper'})[0].h1.span.a.string),
'Director': _Director(res_IMDB),
'Stars':_Stars(res_IMDB),
'Genres': _Genres(res_IMDB),
'Countries':_Countries(res_IMDB),
'Runtime': _Runtime(res_IMDB),
'Rating': float(res_douban.find_all(attrs={'property':'v:average'})[0].string),
'Rating_Per': ','.join([i.string for i in res_douban.find_all(attrs={'class':'rating_per'})]),
'Budget($)': _Budget(res_IMDB),
'BoxOffice($)': _Boxoffice(res_IMDB)}
return movieinfo

def GetDoubanMovieURL(res):
# get all movie url on a content page
url=[res.select('#content .grid_view li .hd a')[i]['href'] for i in range(25)]
return url

if __name__ == '__main__':
InfoLst = []
for i in range(10):
print('Scrapping page %d'%i)
url="https://movie.douban.com/top250?start={0}".format(i*25)
res=GetHTML(url)
urls=GetDoubanMovieURL(res)
for j in range(len(urls)):
time.sleep(np.random.uniform(1,2))
print('Number %d in the page'%j)
MovieInfo=GetDoubanInfo(urls[j])
print(MovieInfo)
InfoLst.append(MovieInfo)

df = pd.DataFrame(InfoLst)
df.to_csv('Douban.csv',index=False,encoding='utf-8')

3. Scrape IMDb.com

# -*- coding: utf-8 -*-
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from download import myrequest
import DoubanIMDB as D

def GetIMDBURL(res):
urls = ['https://www.imdb.com' + i.attrs['href'] for i in res.select('.lister-list .titleColumn a')]
return urls

def GetIMDBInfo(url_movie):
res_IMDB=D.GetHTML(url_movie)
url_rating_page = 'https://www.imdb.com'+res_IMDB.select('.ratings_wrapper .imdbRating a')[0].attrs['href']
res_rating_page = D.GetHTML(url_rating_page)

def _rating_per_IMDB(res_rating_page):
select = res_rating_page.select('.title-ratings-sub-page table[cellpadding="0"] .allText .topAligned')
return ','.join([i.string.strip() for i in select])

movieinfo={'Title' : list(res_IMDB.select('.title_wrapper h1')[0].children)[0].rstrip('\xa0'),
'ReleaseYear': int(res_IMDB.find_all(attrs={'class':'title_wrapper'})[0].h1.span.a.string),
'Director': D._Director(res_IMDB),
'Stars':D._Stars(res_IMDB),
'Genres': D._Genres(res_IMDB),
'Countries':D._Countries(res_IMDB),
'Runtime': D._Runtime(res_IMDB),
'Rating': res_IMDB.select('.ratings_wrapper .ratingValue strong span')[0].string,
'Rating_Per': _rating_per_IMDB(res_rating_page),
'Budget($)': D._Budget(res_IMDB),
'BoxOffice($)': D._Boxoffice(res_IMDB)
}
return movieinfo

if __name__ == '__main__':
InfoLst = []
url="https://www.imdb.com/chart/top?ref_=nv_mv_250"
res=D.GetHTML(url)
urls=GetIMDBURL(res)
for i in range(len(urls)):
time.sleep(np.random.uniform(1,2))
print('Number %d in the page'%i)
MovieInfo=GetIMDBInfo(urls[i])
print(MovieInfo)
InfoLst.append(MovieInfo)

df = pd.DataFrame(InfoLst)
df.to_csv('IMDB.csv',index=False,encoding='utf-8')

III. Data Analysis

Data Summary

Out of top 250 movies, 102 is included in both lists. This is a little bit out of my expectation, since I expected the overlapped ratio was higher. Let’s look closer, and see how they differ with each other.

Most Popular Genres

wordcloud.png

Drama is the most popular genres in both countries. American love Thriller and Crime movies much more than Chinese do, as I expected. Both countries hold equal mount of love towards Comedy and Adventure movies. Surprisingly, Chinese people like Romance movies better, probably because Chinese culture is more reserved.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from os import path
from PIL import Image
import matplotlib.image as mpimg
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from rescale_image import CreateMask
%config InlineBackend.figure_format = 'svg'

df_imdb = pd.read_csv('IMDB.csv')
df_douban = pd.read_csv('douban.csv')

title_both = set(df_imdb['Title'])&set(df_douban['Title'])
title_imdb = set(df_imdb['Title']) - set(df_douban['Title'])
title_douban = set(df_douban['Title']) - set(df_imdb['Title'])

mask_both = df_imdb['Title'].isin(title_both)
df_both =pd.merge(df_imdb[mask_both].rename(columns={'Rating':'Rating_imdb'}),
df_douban.rename(columns={'Rating':'Rating_douban'})[['Title','Rating_douban']],
how='left',on = 'Title')

mask_imdb = df_imdb['Title'].isin(title_imdb)
df_imdb_only = df_imdb[mask_imdb]

mask_douban = df_imdb['Title'].isin(title_douban)
df_douban_only = df_imdb[mask_douban]

map_china = np.array(Image.open("images/China_map.png"))[:,:,3]
genres_douban = ' '.join(df_douban['Genres'].tolist()).replace(',', ' ')
mask = CreateMask(map_china)

wc = WordCloud(background_color="white", max_words=1000, mask=mask,
contour_width=3, contour_color='#013243',collocations=False,
colormap='tab10')

wc.generate(genres_douban)
plt.figure(figsize=[8,8])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

map_us = np.array(Image.open("images/US_map.png"))
genres_imdb = ' '.join(df_imdb['Genres'].tolist()).replace(',', ' ')
mask = CreateMask(map_us)

wc = WordCloud(background_color="white", max_words=1000, mask=mask,
contour_width=3, contour_color='#013243',collocations=False,
colormap='tab10')

wc.generate(genres_imdb)
plt.figure(figsize=[8,8])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
wc.to_file("images/US.png")

How does ratings look like?

 

The left figure is the distribution of all movies, clearly ratings in douban.com tend to be higher. However, is this because the difference in movies chosen, or do douban user just give higher ratings?

The figure explains it. It is the distribution of 102 ratings from movies appear in both sites. Chinese users almost gave higher ratings for every movies! It seems they are not as picky as American users.


f, axes = plt.subplots(1, 1, figsize = (7, 7))
sns.set(style="white", palette="muted", color_codes=True)
sns.distplot(df_imdb['Rating'], color="m",ax=axes,label='IMDb')
sns.distplot(df_douban['Rating'], color="g",ax=axes,label='douban')
axes.set_title('Rating Distribution')
plt.legend()

p = sns.jointplot('Rating_imdb','Rating_douban',data=df_both,xlim=(7.8,10),ylim=(7.8,10),
alpha = .6)
p.set_axis_labels('IMDb', 'douban')
p.ax_joint.plot([0, 10], [0, 10], linewidth=2)
p.fig.suptitle('Joint Rating Distribution')

 

Which countries have more good movies?

TopTen

Interesting fact: Chinese users love all European and Asian movies more than American users do!


# sns.set_style("whitegrid")
count_imdb = pd.Series(','.join(df_imdb['Countries'].tolist()).split(',')).value_counts()
count_douban = pd.Series(','.join(df_douban['Countries'].tolist()).split(',')).value_counts()

df_count = pd.merge(pd.DataFrame(count_imdb),pd.DataFrame(count_douban),
how='inner',left_index=True,right_index=True)
df_count.columns=['Counts IMDb','Counts Douban']
mask = set(df_count.sort_values('Counts IMDb',ascending =False).index[:10].tolist()) \
| set(df_count.sort_values('Counts Douban',ascending =False).index[:10].tolist())
df_count_top10 = df_count[df_count.index.isin(mask)].sort_values('Counts Douban',ascending=False)

fig, ax = plt.subplots()
index = np.arange(len(df_count_top10))
bar_width = 0.35
opacity = 0.7
rects1 = ax.bar(index, df_count_top10['Counts IMDb'], bar_width,
alpha=opacity, color='b',
label='IMDb')
rects2 = ax.bar(index + bar_width, df_count_top10['Counts Douban'], bar_width,
alpha=opacity, color='r',
label='Douban')
ax.set_xlabel('Countries')
ax.set_ylabel('Numbers')
ax.set_title('Top Ten Countries')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(tuple(df_count_top10.index.tolist()),rotation=45)
ax.legend()
fig.tight_layout()

 


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s