This project was created using R in R studio. The code can be found in my Github repository.


Are the batters’ age, BMI or country of birth characteristics that impact their performance in the game?

Introduction

Baseball is one of the oldest and most popular sports, which is practiced in different parts of the world, with a great reception in the United States. This sport is practiced with a ball and a bat between two teams of nine players each. The team that scores the most runs over the nine innings, wins the game.

In this sport there are two main roles. One of them is the pitcher who is in charge of pitching the ball towards the catcher of their team, and the other role is the batter who tries to hit the ball towards the inside of the field. Players, score runs by hitting the ball and running through different bases, before one player is eliminated.

As can be seen, baseball is a very demanding sport, which requires developing physical skills such as reflexes to be able to identify the direction of the ball, speed to reach the next base in time before being caught by the ball, strength in the arms to throwing the ball as far as possible and flexibility to increase the possibility of other physical abilities such as strength, speed, and endurance. All of these are skills that can be developed with training, but that can also decrease over the years if players do not continue training or simply because of the natural physical wear and tear of the human body.

Therefore, the objective is to be able to determine if it is possible that the older a player is, the lower is his performance in the game, or on the contrary, the player has a higher performance in the game because of more years of experience. I also want to determine whether the force exerted by a player to hit a ball or the speed of the player so as not to be caught while running, is linked to his weight and / or height. And last but not least, if the country of birth has any impact to genarate good results in the game. All these lead me to generate the following question, which is intended to be answered in this project: Are the batters’ age, BMI or country of birth characteristics that impact their performance in the game?

To answer this question, I am using a data set downloaded from the official MLB website (The Official Site of Major League Baseball). The data set corresponds to the table of classifications of 132 batters during the 2021.

From the official site, the data set can be customized with different criterias. To develop this project, I am using the following variables:

Variable Description
Last name Batter’s last name
First name Batter’s first name
Team Batter’s team
Country of Birth Batter’s country of birth
Height Batter’s height
Weight Batter’s weight
Player id Batter’s ID
Player age Batter’s age
Total hits (H) A hit occurs when a batter strikes the baseball into fair territory and reaches base without doing so via an error or a fielder’s choice. (MLB Definition)
Single (S) A single occurs when a batter hits the ball and reaches first base without the help of an intervening error or attempt to put out another baserunner. (MLB Definition)
Double (D) A batter is credited with a double when he hits the ball into play and reaches second base without the help of an intervening error or attempt to put out another baserunner. (MLB Definition)
Triple (TR) A triple occurs when a batter hits the ball into play and reaches third base without the help of an intervening error or attempt to put out another baserunner. (MLB Definition)
Home Run (HR) A home run occurs when a batter hits a fair ball and scores on the play without being put out or without the benefit of an error. (MLB Definition)
Batting average (AVG) Batting average is determined by dividing a player’s hits by his total at-bats for a number between zero (shown as .000) and one (1.000). (MLB Definition)
Total bases (TB) Total bases refer to the number of bases gained by a batter through his hits. (MLB Definition)
Total caught stealing (CS) A caught stealing occurs when a runner attempts to steal but is tagged out before reaching second base, third base or home plate. (MLB Definition)
Total stolen base (SB) A stolen base occurs when a baserunner advances by taking a base to which he isn’t entitled. (MLB Definition)


Data Analysis Plan

Variables

For the development of this project I will use the following variables:

  • As an explanatory variable, the age and body mass (BMI) of the players will be used. This last variable will be a new column calculated using the height and weight of each player.

  • For the response variables, total hits, batting average, total bases, total caught stealing and total stolen base will be used.


Exploratory Data Analysis

To understand the data set in detail, especially the variables that I will use to answer the main question of this project, I will do a exploratory data analysis, performing univariate analysis on some numerical and categorical variables.


Univariate Analysis - Numerical Variables


Batters age

The age of the players is one of the main variables required to answer the question of this project. Therefore, it is important to understand the age range of the players in this data set, so that when comparing it with other characteristics, it can be inferred what is the relationship between age and the response variables.


See what the distribution of this variable says in this data set.

The age distribution of the batters in this data set tends to be normal distributed and unimodal (one peak). As can be seen, the mean (29.2 years) is very close to the median (29 years). The youngest player is 22 years old and the oldest is 41 years old.

## [1] "The skewness is: 0.49"

The above, confirms that, this distribution is almost normal distributed, which for the analysis of this project is ideal, considering that there will be no skew when comparing age with the response variables.

## [1] "The Standard deviation of this variable is: 3.67"

Regarding the standard deviation we can see that the value is low, which means that data are clustered around the mean.


Batters BMI

The body mass index is the result of dividing the weight in kilograms by the square of the height in meters. This is a measure that has been used in the field of nutrition to establish whether a person is within an ideal weight or not. What does the BMI distribution of the batters say in this data set?

The distribution of the batters’ body mass index has a small skew to the right, the mean is greater than the median by 0.5 points. This is a unimodal distribution (one peak).

## [1] "The skewness is: 0.63"

The above confirms that this distribution has a skew to the right. However, it is observed that extreme outliers are not presented, which makes this variable very useful when combining it with the response variables.

After analyzing the concentration of the data, that is, quartiles 1 and 3, we can see that most of the players have a BMI between 26 and 29. If we compare these data with respect to the “ideal IBM”, it is observed that the players they tend to be “overweight” (ideal IBM is maximum 24). It is important to keep in mind that overweight can be represented not only in body fat, but also by the weight of the bones and muscles.

Now the question is, considering the above, Could the weight of the player influence his speed when running? This will be one of the questions to be answered in this project.


Hits

A hit occurs when a batter hits the baseball in fair territory and reaches base without doing so via an error or the choice of a fielder. There are four types of hits, depends on how far the ball falls. Based on different factors such as force and speed with which the pitcher throws the ball and of course the force with which the batter hits it.

Will the force exerted by the batter to hit a home run, depend on his age, BMI? or none of these? We will see this later. For now, let’s look at the distribution of hits in this data set.


As can be seen in the plot, the distribution of hits in this data set shows that players have registered more than 50 times simple hits, while triple hits, have been almost scarce (approximately more than 80 times no hits triple were registered in this data set)

Let’s look at the statistical measures:

Description Min Q1 Median Mean Q2 Max
Total hits 99 123.00 138.0 140.45 156.25 195
Single 46 73.75 86.5 87.05 98.00 136
Double 13 24.00 28.0 28.01 32.25 42
Triple 0 1.00 2.0 2.17 3.00 8
Home Round 2 15.00 23.0 23.22 31.00 48

The distribution of the different hits tends to be uniform with a small skew to the right. (All mean are greater than the median). The most common hits are the singles, where the players have managed to register between 46 and 136. It seems that it is easier to achieve a home run than to achieve a triple hit. As observed in the statistical measures, the data concentration of the triple hits is between 1 and 2 records while the home rounds are between 15 and 23.


Other numerical variables

The following plot shows the distribution of four of the response variables that I will use for the pleated analysis. The objective is to understand how these variables are represented individually in the data set and how they can be used to answer the main question.

  • Batting Average

    It is determined by dividing a player’s hits by his total at-bats. Batting average doesn’t take into account the number of times a batter reaches base via walks or hit-by-pitches. And it doesn’t take into account hit type (with a double, triple or home run being more valuable than a single). (MLB definition)

    The distribution of this variable seems to be normally distributed, considering that the mean 0.26 is very close to the median 0.27. On the other hand, the distribution of this variable indicates that the players achieve a successful hitting of 26% on average. Having a higher batting average can depend on different factors, one of which is that players require a high level of movement coordination and high reaction speed. Will these factors depend on the player’s age or muscle mass? We will see this in the write up of the project.


  • Total Bases

    According to the distribution of this variable, there is a small skew to the right. The data concentration is between 212 and 270 bases. This variable tends to have a direct relationship with hits, so the higher the number of bases, the better the player’s performance in the game.


  • Total Caught Stealing

    A caught stealing occurs when a runner attempts to steal but tagged out before reaching second base, third base or home plate. This is a bad indicator for a player. It may mean that they did not have enough speed to run to the next base before being caught. The distribution of this variable is skewed to the right. It is not very common for players to have high numbers of caught stealing. The data concentration is between 2 and 4.


  • Total Stolen Base

    This is another negative variable for a player, it means that due to lack of concentration they ran to a base without being allowed, which causes losing points for their team. The distribution of this variable is skewed to the right and the data concentration is between 2 and 13 records.


Univariate Analysis - Categorial Variable

The batters that are part of this data set were born in different countries. The objective is to know the distribution of the country of birth to be able to evaluate in the write up of the project, if there is any relationship between the player’s country of birth and his performance in the game.


As expected, most of the players in this data set were born in the United States. However, there is a significant number of players who were born in other countries such as Venezuela, the Dominican Republic and Cuba. Do players born in these last three countries perform better in the game than players from the United States? This will be evaluated later.






Regression Model

To complement the previous analyzes, I will calculate the correlation coefficient between the age and body mass of the players versus the total hits, batting average, total bases, total caught stealing and total stolen base.

The objective is to determine what is the relationship between these variables, as well as to establish the slope and intercept applying a regression model, to understand which of these variables is useful or significant to solve the question of this project. I will select the best model between linear and spline.

Finally, and only if any of the relationships and variables is significant for the analysis, a predictive variable will be calculated based on a given variable.


Data Set

This is the detail of the data set once the data cleaning and data transformation has been applied.

## Rows: 132
## Columns: 18
## $ last_name     <chr> "Cabrera", "Cruz Jr.", "Peralta", "Blackmon", "McCutchen~
## $ first_name    <chr> " Miguel", " Nelson", " David", " Charlie", " Andrew", "~
## $ Team          <chr> "Detroit Tigers", "Tampa Bay Rays", "Arizona Diamondback~
## $ country_birth <chr> "Venezuela", "Dominican Republic", "Venezuela", "USA", "~
## $ Height        <int> 193, 188, 185, 190, 180, 180, 188, 183, 180, 183, 188, 1~
## $ Weight        <int> 121, 104, 95, 100, 88, 91, 99, 95, 97, 81, 94, 97, 85, 8~
## $ player_id     <int> 408234, 443558, 444482, 453568, 457705, 457759, 458015, ~
## $ player_age    <int> 38, 41, 34, 35, 35, 37, 38, 33, 35, 37, 34, 37, 32, 32, ~
## $ H             <int> 121, 136, 126, 139, 107, 148, 119, 121, 121, 112, 146, 1~
## $ `1B`          <int> 90, 82, 80, 97, 55, 99, 59, 91, 87, 70, 106, 123, 89, 80~
## $ `2B`          <int> 16, 21, 30, 25, 24, 22, 23, 25, 15, 28, 29, 31, 30, 26, ~
## $ `3B`          <int> 0, 1, 8, 4, 1, 0, 1, 2, 0, 0, 3, 0, 3, 5, 2, 3, 2, 1, 2,~
## $ HR            <int> 15, 32, 8, 13, 27, 27, 36, 3, 19, 14, 8, 15, 9, 28, 15, ~
## $ AVG           <dbl> 0.256, 0.265, 0.259, 0.270, 0.222, 0.278, 0.266, 0.243, ~
## $ TB            <int> 182, 255, 196, 211, 214, 251, 252, 159, 193, 182, 205, 2~
## $ CS            <int> 0, 0, 1, 0, 2, 0, 0, 2, 0, 0, 1, 1, 3, 0, 6, 0, 0, 3, 1,~
## $ SB            <int> 0, 3, 2, 3, 6, 3, 1, 13, 2, 0, 1, 1, 13, 1, 14, 0, 12, 5~
## $ BMI           <dbl> 32, 29, 28, 28, 27, 28, 28, 28, 30, 24, 27, 29, 25, 27, ~


Write Up

In Baseball as in any other sports discipline, achieving the best results as a player depends on many variables that constantly change, such as, for example, the field of play, the combination of players in the games, the judges, the weather and even the player’s mood during the match. In other words, it can be difficult to predict exactly what the result will be in a future match. However, thanks to statistics and the collection of data during history, it is possible to establish certain patterns that allow identifying strengths and weaknesses of the players, as well as establish the probability of occurrence of certain events based on specific variables.

Considering the above, the objective of this project is to identify if the country of birth of a batter, his age or his body mass index are factors that impact the performance in the game, evaluating the number of hits, the batting average, the total bases, among other variables.


Batters’ Country of Birth Analysis

It is strange to think that the country of birth of the players has an impact on performance in the game. To understand if this is possible, it is important to know a little about the relationship that countries have with this sport. First, as we saw in the exploratory data analysis within the project proposal, the players that are part of the data set that we are evaluating were born in eleven countries, of course the United States being the country with the most players. The next three countries with the most players are the Dominican Republic, Venezuela, and Cuba.


- In the Dominican Republic, this sport has more than 100 years within the culture of this country. It is not known exactly how this sport came to the country, but some historians suggest that baseball arrived in the year 1880.

- In Venezuela, this sport began its rage, when American workers settled in the city of Zulia to work in the oilfields. The Americans brought bats and balls with them and introduced the sport to the locals. The first baseball team was established in 1895 but until 1941 this game became a total national obsession.

- In Cuba, this sport began to expand in 1860 by American migrants who arrived in the country. Today this sport is very important in Cuba. It is believed that it was the Cubans (who were fleeing a war in their country) who introduced this sport to the Dominican Republic and other regions of the Caribbean.

- As for the United States, it is not necessary to say much, because it was in this country where this sport was born, and as we observed, in the previous paragraphs, it was this country that influenced the adoption of this sport in other countries.


With this context in mind, let’s analyze these countries:

This graph shows the distribution of total hits by country of birth of the batters. The green line represents the median of the hits for the entire data set. According to the previous analysis carried out in the project proposal, this value is equal to 138 hits. So, according to this graph, the batters who were born in Venezuela have a lower median of hits (124) than the total, while the batters who were born in the Dominican Republic have the highest median of the total of the data set.

The batters who were born in the other two countries evaluated also show a median higher than the general median, so if the performance of the players in terms of hits were measured only by the median, it could be said that, the players who were born in Cuba, the United States and the Dominican Republic, they are good at hitting the ball.


Once the lm model was adjusted to the data set, the following predictions were obtained:

Country of Birth Prediction
Cuba 139
Dominican Republic 141
USA 142
Venezuela 130


Despite the fact that batters born in the Dominican Republic have a higher median than those born in the United States, the predictive model indicates that the number of hits a batter from the United States can reach is higher (by one point).

On the other hand, despite the fact that players born in Venezuela have a median close to 125 hits, the model predicts that if a player is born in that country, he has a probability of achieving 130 hits.

Now, let’s analyze the relationship between the players’ country of birth and four new variables. The green line represents the median of the total data, which may be different for each country of birth. The red diamonds represent the prediction for each country once the lm model has been applied.

This distribution shows that the batters who were born in the Dominican Republic have a batting median and total bases higher than the general median, which indicates that these players have a good performance, as far as these two variables are concerned. While players born in Venezuela have a median lower than the general median in batting and bases.

In terms of caught stealing and stolen base, players born in the Dominican Republic also have a median higher than the general median. However, this is a not so good indicator for these players, because as we saw in the previous analysis in the project proposal, the higher the number in this result, the fewer points are recorded for the team to which the player belongs.


Now let’s analyze the predictions produced by the lm model:

Country of Birth BA TB CS SB
Cuba 0.266 235 3 5
Dominican Republic 0.272 247 4 14
USA 0.265 243 2 9
Venezuela 0.253 221 2 5

According to the model, players born in the Dominican Republic have the probability of having a batting average of 0.272 which is very good, considering that the batting average of the entire league has generally hovered around 0.250, as indicated MLB on its official website. The same happens with players born in Cuba and the USA, who, even though they do not have the maximum batting average according to the predictions, do have a value higher than the average. On the other hand, Venezuelan-born batters have the lowest batting average according to the predictions.


A similar pattern can be seen in Total Bases predictions, where Dominican-born batters lead the list with the highest value and Venezuelan players are the ones with a low and even lower than average prediction (242) and median (237).

On the other hand, analyzing the predictions of caught stealing and Stolen Base, it can be observed that the players from the Dominican Republic again have the highest number in these two variables. However, this is not a good indicator for the players, the higher the number in these variables, the more score or plays the team lost.

In conclusion, it is observed that Players born in the Dominican Republic are good at hitting the ball and gaining bases, but they tend to be more easily caught stealing.


Batters’ Age Analysis

The body of the human being constantly changes year after year, with periods of time where these changes are more noticeable, such as in childhood, adolescence, and old age. During the first stages of life, bones and muscles are strengthened, attention and flexibility are increased, among many factors, while in the last stage all these changes begin to be reversed and become negative.

Now, there is an intermediate stage where the changes are not significant in a constant environment, but in an environment such as sports, with training, many of those strengths extend for a longer time.

Is age a characteristic that impacts batters’ performance in the game? The batters in this data set are between the ages of 22 and 41 years old. Is it possible for young players to perform better in the game simply because they are young? Or, on the contrary, do older players perform better because they have more time on the field and have developed more skills that younger players do not yet have? To answer these questions, I will apply the linear regression model to identify if age is a factor that should be considered to evaluate whether a player can perform well in the game.


Relationship between Batters’ age and Total Hits

In order to determine if there is any relationship between the age of the players and the total hits, I created a scatterplot and added a line to identify if there is any pattern that allows inferring about this relationship, I created the line applying a non-linear model, using the following variables: lm(H ~ns(player_age, 5), data=baseball)


The coefficient correlation for this relationship is:

## [1] -0.19

According to what is visualized in the plot and considering the calculated correlation coefficient, the relationship between these two variables is very weak, with a negative trend, that is, the older a player is, the apparently less number of hits they get.

Analyzing the direction of the model (green line) a low peak is observed when the player is 30 years old, that is, the number of hits decreases between batters from 29 years to 30 years and then there is a small increase between 30 and 31 years.

Let’s look at the analysis of the model in detail:

## 
## Call:
## lm(formula = H ~ ns(player_age, 5), data = baseball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.105 -14.427  -2.457  13.851  53.895 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         157.587     10.705  14.721   <2e-16 ***
## ns(player_age, 5)1  -15.062     11.220  -1.342   0.1819    
## ns(player_age, 5)2  -23.164     13.181  -1.757   0.0813 .  
## ns(player_age, 5)3   -8.174     12.061  -0.678   0.4992    
## ns(player_age, 5)4  -38.905     26.138  -1.488   0.1391    
## ns(player_age, 5)5  -26.516     16.965  -1.563   0.1206    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.44 on 126 degrees of freedom
## Multiple R-squared:  0.0503, Adjusted R-squared:  0.01262 
## F-statistic: 1.335 on 5 and 126 DF,  p-value: 0.2538


  • Intercept: 157.587 is the point where regression line intersects the y-axis. In the context of this analysis, this means that when the players are zero years old, the y-axis equals 157 hits. This does not make sense, specially, because the data set is very far from this intercept, since the minimum age of the players is 22 years.

  • Slope: This model, as it is not linear, presents different knots. The first knots is -15, which means that when the players are young and increase a year of life, the number of hits decreases on average by 15 hits. Whereas when the player is among the majors, the number of hits decreases an average of 23 hits for each new year of life.

  • Multiple R-Squared: This value, which is 0.0503, indicates that the probability of predicting the number of hits using age as a reference is only 5%, which allows to conclude that age is not a factor when determining the performance of a batter in the game, as far as hits are concerned.

  • Due to the low ratio of these two variables, I do not consider it necessary to perform a residual value analysis.


Relationship between Batters’ age and Batting Average

In order to determine if there is any relationship between the age of the players and the batting average, I created a scatterplot and added a line to identify if there is any pattern that allows inferring about this relationship, I created the line applying a non-linear model, using the following variables: lm(AVG ~ns(player_age, 5), data=baseball)

The coefficient correlation for this relationship is:

## [1] -0.14

According to what is visualized in the plot and considering the calculated correlation coefficient, the relationship between these two variables is very weak, with a negative trend, that is, the older a player is, the apparently less average of batting they get.

Analyzing the direction of the model (green line) a significant decrease is observed in the batting average among players aged 22 and 25 years. Then the decline continues, but in a slower manner. After the age of 32, it is observed that the average slowly increases again.

Let’s look at the analysis of the model in detail:

## 
## Call:
## lm(formula = AVG ~ ns(player_age, 5), data = baseball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.06485 -0.01599  0.00082  0.01536  0.06415 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.29094    0.01271  22.882   <2e-16 ***
## ns(player_age, 5)1 -0.02571    0.01333  -1.929   0.0559 .  
## ns(player_age, 5)2 -0.03276    0.01566  -2.093   0.0384 *  
## ns(player_age, 5)3 -0.01602    0.01433  -1.119   0.2655    
## ns(player_age, 5)4 -0.04921    0.03105  -1.585   0.1154    
## ns(player_age, 5)5 -0.01075    0.02015  -0.533   0.5948    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02546 on 126 degrees of freedom
## Multiple R-squared:  0.05254,    Adjusted R-squared:  0.01494 
## F-statistic: 1.397 on 5 and 126 DF,  p-value: 0.2297


  • Intercept: 0.290 is the point where regression line intersects the y-axis. In the context of this analysis, this means that when the players are zero years old, the y-axis equals 0.290 batting average. This does not make sense, specially, because the data set is very far from this intercept, since the minimum age of the players is 22 years.

  • Slope: This model, as it is not linear, presents different knots. The first knot is -0.025 which means that when the players are young and increase a year of life, the batting average decreases on average by 0.025. Whereas when the player is among the majors, the batting average decreases an average of 0.010 for each new year of life.

  • Multiple R-Squared: This value, which is 0.052, indicates that the probability of predicting the batting average using age as a reference is only 5%, which allows to conclude that age is not a factor when determining the performance of a batter in the game, as far as batting average are concerned.

  • Due to the low ratio of these two variables, I do not consider it necessary to perform a residual value analysis.


What happens if we include the country of birth in this analysis?

Now we are going to see the previous relationships by adding a categorical variable, the players’ country of birth. For this analysis, I am including the four countries with the most players, these are: United States (US), Dominican Republic (DO), Cuba (CU) and Venezuela (VE).

The model used in this analysis is the following: lm(y ~ x1 * x2, data) where y are the response variables for total hits and batting average, x1 is the continuous variable age and x2 is the categorical variable country of birth.



Now we can see that there is a positive relationship between age and total hits for players who were born in Cuba. The correlation coefficient is 0.34. In other words, the older a Cuban-born player is, the more likely it is to have a greater number of hits. While players born in the Dominican Republic and the United States tend to have fewer hits as they get older. The correlation of players born in the United States is -0.20 and those born in the Dominican Republic is -0.24.

The same pattern is observed, when comparing the age of the players, the batting average and the country of birth. Players from Cuba have a positive relationship in these variables.


Batters’ BMI Analysis

In some sports such as athletics, the body mass index is important therefore, applying Newton’s law of gravity, the lower the weight, the higher the speed. In the context of Baseball and considering the variables that we are evaluating, the objective is to determine if there is any relationship between the BMI of the players and the number of hits or batting average. Is it possible that the more BMI, the greater the force in the arms to achieve a hit, or the better the precision when hitting the ball? Let’s analyze these variables by applying linear regression models.


Relationship between Batters’ BMI and Total Hits

In order to determine if there is any relationship between the BMI of the players and the total hits, I created a scatterplot and added a line to identify if there is any pattern that allows inferring about this relationship, I created the line applying a non-linear model, using the following variables: lm(H ~ns(BMI, 5), data=baseball)


The coefficient correlation for this relationship is:

## [1] -0.15

According to what is visualized in the plot and considering the calculated correlation coefficient, the relationship between these two variables is very weak, with a negative trend, that is, the higher the BMI is, the apparently less number of hits they get.

Analyzing the direction of the model (red line) it is observed that when the players’ BMI increases between 25 and 28, the hits decrease, but when the BMI goes from 28 to 30, the BMI increases slowly again.

Let’s look at the analysis of the model in detail:

## 
## Call:
## lm(formula = H ~ ns(BMI, 5), data = baseball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.293 -15.154  -0.586  14.818  47.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  149.926     16.453   9.112 1.57e-15 ***
## ns(BMI, 5)1  -12.449     16.042  -0.776    0.439    
## ns(BMI, 5)2  -18.647     18.017  -1.035    0.303    
## ns(BMI, 5)3   -5.343     13.668  -0.391    0.696    
## ns(BMI, 5)4   -5.720     36.806  -0.155    0.877    
## ns(BMI, 5)5  -18.670     18.371  -1.016    0.311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.27 on 126 degrees of freedom
## Multiple R-squared:  0.06522,    Adjusted R-squared:  0.02812 
## F-statistic: 1.758 on 5 and 126 DF,  p-value: 0.1262


  • Intercept: 150 is the point where regression line intersects the y-axis. In the context of this analysis, this means that when the players are zero BMI, the y-axis equals 150 hits. This does not make sense, specially, because the data set is very far from this intercept, since the minimum BMI of the players is 22.

  • Slope: This model, as it is not linear, presents different knots. The second knot is -21 which means that when the players have a low BMI and increase the BMI, the number of hits decreases on average by 21 hits. Whereas when the player have a high BMI, the number of hits decreases an average of 16 hits for each new BMI.

  • Multiple R-Squared: This value, which is 0.0604, indicates that the probability of predicting the number of hits using BMI as a reference is only 6%, which allows to conclude that BMI is not a factor when determining the performance of a batter in the game, as far as hits are concerned.

  • Due to the low ratio of these two variables, I do not consider it necessary to perform a residual value analysis.


Relationship between Batters’ BMI and Batting Average

In order to determine if there is any relationship between the BMI of the players and the batting average, I created a scatterplot and added a line to identify if there is any pattern that allows inferring about this relationship, I created the line applying a non-linear model, using the following variables: lm(AVG ~ns(BMI, 5), data=baseball)

The coefficient correlation for this relationship is:

## [1] -0.15

According to what is visualized in the plot and considering the calculated correlation coefficient, the relationship between these two variables is very weak, with a negative trend, that is, the higher the BMI is, the apparently less average of batting they get.

Analyzing the direction of the model (red line) a significant decrease is observed in the batting average among BMI between 22 and 28 years. After the age of 30, it is observed that the average slowly increases again.

Let’s look at the analysis of the model in detail:

## 
## Call:
## lm(formula = AVG ~ ns(BMI, 5), data = baseball)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.063987 -0.014040  0.000622  0.016616  0.058628 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.29122    0.01960  14.855   <2e-16 ***
## ns(BMI, 5)1 -0.03092    0.01911  -1.618    0.108    
## ns(BMI, 5)2 -0.03285    0.02147  -1.530    0.128    
## ns(BMI, 5)3 -0.02151    0.01629  -1.321    0.189    
## ns(BMI, 5)4 -0.03889    0.04386  -0.887    0.377    
## ns(BMI, 5)5 -0.01883    0.02189  -0.860    0.391    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02534 on 126 degrees of freedom
## Multiple R-squared:  0.06145,    Adjusted R-squared:  0.0242 
## F-statistic:  1.65 on 5 and 126 DF,  p-value: 0.1516


  • Intercept: 0.294 is the point where regression line intersects the y-axis. In the context of this analysis, this means that when the players are zero BMI, the y-axis equals 0.294 batting average. This does not make sense, specially, because the data set is very far from this intercept, since the minimum BMIof the players is 22.

  • Slope: This model, as it is not linear, presents different knots. The first knot is -0.027 which means that when the players have low BMI, the batting average decreases on average by 0.027. Whereas when the player have a higher BMI, the batting average decreases an average of 0.016 for each new BMI.

  • Multiple R-Squared: This value, which is 0.025, indicates that the probability of predicting the batting average using BMI as a reference is only 2.5%, which allows to conclude that BMI is not a factor when determining the performance of a batter in the game, as far as batting average are concerned.

  • Due to the low ratio of these two variables, I do not consider it necessary to perform a residual value analysis.


What happens if we include the country of birth in this analysis?

Now we are going to see the previous relationships by adding a categorical variable, the players’ country of birth. For this analysis, I am including the four countries with the most players, these are: United States (US), Dominican Republic (DO), Cuba (CU) and Venezuela (VE).

The model used in this analysis is the following: lm(y ~ x1 * x2, data) where y are the response variables for total hits and batting average, x1 is the continuous variable age and x2 is the categorical variable country of birth.


In the first graph, a pattern similar to that seen in the analysis of the age of the players is observed. Cuban-born batters, the more BMI they have, the more likely they are to have a higher number of hits.

Regarding the analysis of the batting average, it is observed that for players born in Cuba and Venezuela, there is no relationship between these two variables, that is, if the BMI increases, it does not significantly affect the batting average. However, for players born in the Dominican Republic and USA this relationship is negative with a coefficient of correlation of -0.44 and -0.22 respectively, indicating that the higher the BMI, the lower the batting average the players have.

Conclusion and discussion


The purpose of this project was to identify whether the age, body mass index, or country of birth of batters influences their performance in the game. Once the different analyzes have been carried out, it can be concluded that neither the age nor the BMI of the players are factors or characteristics that have a close relationship with their performance in the game, especially speaking of the number of hits or the batting average.

As can be seen in the result of the Multiple R-Squared, all the relationships have a value lower than 6%, which indicates that neither age nor BMI are variables that impact the response variables (Hits and batting average).

On the other hand, considering the correlation coefficient, all the analyzed combinations have a weak relationship, since the coefficient is less than 0.2. A pattern that can be visualized in this correlation is that they are all negative, which indicates that the values (Hits and batting average) tend to decrease with the increase of the years and the increase of the BMI of the players.

Analyzing the players’ country of birth, it can be concluded that the players from the Dominican Republic have the best performance in the game, with predictive values above the average. While the players who were born in Venezuela tend to have the lowest predictive values.

Regarding the methods used for this analysis, different models were applied to identify the best possible prediction. However, the relationship between the explanatory and response variables was very weak, so that, in all models, the R-Squared multiple was always low. I would have liked to visualize the residuals of the predictions, but having such a low probability, this analysis would not have been important.

Regarding the data set, it can be concluded that the analyzed variables do not present outliers that affect the prediction result. In fact, both age and BMI, which were the explanatory variables, had a distribution very close to being normal.

If I could start the project again, I would like to look for other variables that have a stronger relationship, such as, for example, it would be interesting to be able to analyze the training hours of the players, the type of diet, number of games played, number of hours sleep, among other variables, which allow to conclude which are those characteristics that affect their performance in the game.


Resources

The following sources were consulted for the development of this project (include some images):


  • What is my BMI? (2020, September 17). Centers for Disease Control and Prevention. Click here.

  • Why are Venezuelans so crazy about beisbol? (2016, January 2). Caracas Chronicles. Click here.

  • DR1.com. (n.d.). DR1.com - Dominican Republic News & Travel Information Service. Click here.

  • Cuban baseball - A guide to baseball in Cuba. (2018, February 28). Cuban Travel Business. Click here.

  • Baseball in America: A history. (2017, February 11). Click here.