Improving what skill can have the greatest impact on a player’s overall soccer ability? Perhaps you would think working on your shooting skill is more important than working on the ability to stand tackle. Or perhaps you would think dribbling in soccer is the utmost determining factor of one’s soccer skill.
FIFA Ultimate Team is a game-mode of a popular soccer game FIFA, in which users collect player cards, build their own teams and compete with one another. The game consists of a database of players, to whom the company EA assigned skill values in various categories such as speed, acceleration, shooting power. etc. For instance, Cristiano Ronaldo’s card looks like this below. On top of the six values on the face of the card, each player also has various other traits with associated values. In this blog, I will refer to these as player abilities. In case you are unaware, all of these values are on the scale of 1-100, where 100 is the best.
Having discovered such a complete data set of the statistics of over 10,000 soccer players, I decided to write a simple Python script to answer the above question.
I began by extracting the useful data. The data set consists of countless columns, but I only need the data from column "crossing" to column "sliding tackle", which contain the skill values of all those 10,000 players' abilities in various categories.
Then, by examining the overall trends in the data, I decided to utilize the most basic regression algorithm - linear regression - since all "player ability vs overall rating" graphs display an upward, linear trend. Want to know how linear regression works using gradient descent? Check out our earlier blog!
To answer the question, I decided to focus on how steep the slope is for the best fit line of the dataset, as the steeper the slope, the bigger the increase in the overall rating for a smaller increase in the specific trait. For instance, let the linear regression best fit line of shooting vs overall rating have the steepest slope among other relationships. This means that an improvement upon shooting ability will introduce the greatest gain in the player's overall rating, hence making shooting the most important skill. Think about it in terms of specific heat capacity in chemistry, the amount of heat required to raise the temperature of 1 kilogram of a substance by 1 kelvin. Here, to raise the overall rating of a player by 1 would require the least amount of increase in the value of shooting ability.
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression soccer_data = pd.read_csv('fifa.csv') coef = [] names = [] stats = soccer_data.loc[:, 'crossing':'sliding_tackle'] rating = soccer_data['overall_rating'].values for i in np.arange(0, len(stats.columns)): col_name = stats.columns[i] names.append(col_name) X = stats.iloc[:, i].values.reshape(-1, 1) lin_reg = LinearRegression() lin_reg.fit(X, rating) coef.append(lin_reg.coef_[0]) result = np.array([names, coef]).transpose() print(result[np.array(coef).argmax()])
In[0]: ['reactions' '0.6566267597553436']
The last line of the above script prints the maximum value of "coef_", which represents the slope of a specific regression curve. Turns out, the reaction is the most important skill according to FIFA.
The ability associated with the least value of slope coefficient is balance.
In[1]: ['balance' '0.056336173296759225']
Being an avid FIFA fan myself, I am greatly unsure about these players’ “stats”, or the values that represent their soccer ability. Especially when you begin to compare players, these numbers appear to be even more absurd. When one player is clearly a lot faster than another player, the stats on these FIFA cards often says otherwise. This experiment was purely for me to practice using data analysis tools such as Numpy and the machine learning library Sciki-Learn. Though the ability to react swiftly to different situations in a game is definitely important, I do not think the conclusion drawn from this experiment was accurate, as I had not taken into account the correlation of the data and whether the model under-fitted.
You can find the dataset that I used from here
Insightful