In God we trust, all others bring data.

Top

Site Menu

Introduction

In the summer 2019, there were 23 movies in the Marvel cinematic universe. Several of these are among the most successful movies of all time. As you think about these movies many questions might come to mind, such as how long do they tend to be? how much money do they make? what proportion of the movies feature the character Black Widow? Techniques collectively referred to as methods of 'Descriptive Statistics' can facilitate finding the answers to questions like these.

Marvel Super heroes

Descriptive Statistics (also called Exploratory Data Analysis (EDA) or data summary), encompasses statistical methods for summarizing and exploring data. These techniques may be used in conjunction with methods of formal methods of inference.

With the Marvel Data, the movies in the dataset constitute the entire population. There is no desire here to generalize results to a larger universe but only to understand the characteristics of the data to hand. The following sequence provide a useful guide for exploring data with descriptive statistics.

  1. Start simple: look at variables individually first then look at relationships between them.
  2. Use a variety of tools, including both numerical and graphical summaries.
  3. Look for patterns in the data and anomalies.


Data relevant to the Marvel CU movies are shown in the table. Looking at it in this form, it's easy to pick out facts about individual movies but difficult to draw conclusions about the movies overall.

Data about Avengers movies

In contrast to the table, patterns in the data are obvious when looking at the plots below.

Plots of Avengers movie data: a barplot, histogram, boxplot, and scatterplot

Similary, numeric summaries can convey much of the information provided by the raw data but in an easy to grasp format.


Data Types

Most common data are either categorical (composed of words or categories) or numeric (consisting of numbers). In the dataset, the variables 'BlackWidow' and 'AcademyAward' are categorical (or qualitative) variables. Numeric (or quantitative data) consist of numbers. The variables RunTime and Revenue are numerical variables. The tools we use to describe or summarize the data depend on the data type.

Categorical data: (qualitative) data composed of words or categories.

Numeric data: (quantitative) data composed of numbers.


Example: Think of variables to describe the people you know. Can you name 3 variables that are categorical? 3 that are numeric?

Some categorical variables that describe people are: eye color, gender, main mode of transportation.

Numeric variables include: age, height, number of scars.

The distinction between categorical and numeric data is not always as clear cut as determining whether or not the values are expressed as numbers. Consider, for example, the variable 'Phase' in the marvel dataset. Though the values of this variable are numbers, the numbers are ordered labels rather than quantities. These values could easily be replaced with labels such as 'first', 'second', or 'third' without substantially changing their meaning. The same is not true for the values of the variable 'Revenue'.