Introduction
In the summer 2019, there were 23 movies in the Marvel cinematic universe. Several of these are among the most successful movies of all time. As you think about these movies many questions might come to mind, such as how long do they tend to be? how much money do they make? what proportion of the movies feature the character Black Widow? Techniques collectively referred to as methods of 'Descriptive Statistics' can facilitate finding the answers to questions like these.
Descriptive Statistics (also called Exploratory Data Analysis (EDA) or data summary), encompasses statistical methods for
summarizing and exploring data. These techniques may be used in conjunction with methods of formal methods of inference.
With the Marvel Data, the movies in the dataset constitute the entire population. There is no desire here to generalize results to a larger universe but only to understand
the characteristics of the data to hand.
The following sequence provide a useful guide for exploring data with descriptive statistics.
Data relevant to the Marvel CU movies are shown in the table. Looking at it in this form, it's easy to pick out facts about individual movies but difficult to draw conclusions about the movies overall.
In contrast to the table, patterns in the data are obvious when looking at the plots below.
- (Plot 1) Less than half of the Marvel movies feature Black Widow (Natasha Romanova).
- (Plot 2) The distribution of Marvel movie revenue has a long right tail meaning that while most movies brought in around $800 million, a few of the movies brought in much more.
- (Plot 3) All of the Marvel movies are fairly long, most between 2 and 2.5 hours, one of the movies (Avengers: Endgame) was significantly longer at just over 3 hours!
- (Plot 4) In general, Marvel movies with larger budgets tended to bring in more money but this is not a perfect relationship. The Marvel movie that brought in the most money was not the one that was most expensive to create.
Similary, numeric summaries can convey much of the information provided by the raw data but in an easy to grasp format.
- About 43% of the Marvel movies feature Black Widow.
- The median revenue of a Marvel movie was $\$$854 million, that means half of the movies made less than $\$$854 million and half made more.
- The maximum amount generated by one of these movies (Avengers: Endgame) was $\$$2.798 billion!
- The middle 50% of Marvel movies were between 119 and 137 minutes long.
- The correlation between budget and revenue of a Marvel movie is about 0.75. That means, as the budget went up by $\$$1 million, the revenue went up by an average of 0.75 × $\$$ 1 million. So knowing what the movie budget was is fairly useful in estimating the revenue.
Data Types
Most common data are either categorical (composed of words or categories) or numeric (consisting of numbers). In the dataset, the variables 'BlackWidow' and 'AcademyAward' are categorical (or qualitative) variables. Numeric (or quantitative data) consist of numbers. The variables RunTime and Revenue are numerical variables. The tools we use to describe or summarize the data depend on the data type.
Categorical data: (qualitative) data composed of words or categories.
Numeric data: (quantitative) data composed of numbers.
Some categorical variables that describe people are: eye color, gender, main mode of transportation.
Numeric variables include: age, height, number of scars.
The distinction between categorical and numeric data is not always as clear cut as determining whether or not the values are expressed as numbers. Consider, for example, the variable 'Phase' in the marvel dataset. Though the values of this variable are numbers, the numbers are ordered labels rather than quantities. These values could easily be replaced with labels such as 'first', 'second', or 'third' without substantially changing their meaning. The same is not true for the values of the variable 'Revenue'.