If there's a place in analytics where eyeballs still have some edge over computers, it's probably in figuring out playing styles. Watching a game, it's easy enough to see if a basketball team plays mostly man or zone defense, if a college football team runs a pro-style or spread offense, etc. But figuring that out from box scores, or even play-by-play data alone? Not nearly as easy.
In tennis, the same issue applies. Watching matches can easily tell us that Rafael Nadal is an aggressive baseliner, John Isner has a booming serve, and Fabio Fognini is a counterpuncher, but how can we infer that from stats alone?
There's no shortage of metrics to measure playing style in tennis. However, disaggregating the effects of surfaces on those stats and adjusting for quality of competition is a complicated task in itself. The alternative is a simpler, more indirect measure of playing style: how well players do on each type of surface (clay, hard, indoor hard, and grass). There's plenty of variation between players on surface-specific performance alone, and those factors lend themselves to some pretty neat graphing techniques. The one below is largely inspired from a Sloan presentation doing a similar analysis of redefining NBA positions.
I took the top 300 men's players according to Advanced Baseline and looked at their surface factors for each of the major four surfaces. I ran a k-nearest neighbors analysis to connect each player to 5 other players whose surface factors are most similar. The closer their surface factors, the higher the weight assigned to their connection. Once the nearest neighbor network was constructed, I ran a community detection algorithm to group the players according to distinct playing styles. Below is a force-directed graph of the resulting network. Each color represents a playing style, and each player's node is scaled according to their AB base rank. (Click for full size.)
Looks pretty, right? I think the clustering is pretty close to how we would separate playing styles by the eyeball test. The graph by itself doesn't say explicitly what playing style each color represents, but we can reverse-engineer the coloring process by seeing how the surface factors within each color are similar.
Red: The All-Surface Players
Most of these players have high surface factors on at least two or three surfaces. This explains why players like Djokovic and Murray are in here. The ability to do well on multiple surfaces is a distinct enough trait that it got its own grouping.
Green: Clay Court Players, Non-Specialist
So how come Nadal isn't in that red group, when he can clearly do well on all surfaces? Because he's really good on clay. And being really good on clay, not just good, trumps being all-surface for classification purposes. Everyone here has clay as their best surface far and away, and they don't have many glaring weaknesses on other surfaces, unlike the following group.
Purple: Clay Court Players, Specialist
Pretty much everyone here can only thrive on clay. If you want a cheat sheet for your French Open land mines, look no further.
Orange: Hard Court Players
Interestingly enough, there's no big distinction between hard-court specialists and non-specialists like there are clay players. I don't know why yet, but I feel like it's indicative of an important distinction between hard and clay surfaces.
Also, if you were looking for the Americans, they're pretty much all here.
The groupings aren't perfect when you compare actual playing styles to computer groupings, but as a starting point, I think they do pretty well. The real value beyond making pretty graphs is using these groupings as the basis for aging curves. The more accurately you can categorize players as similar, the more specific the aging curve you can assign them based on other similar players. In part two, I'll show the same graph on the women's side.