I am not sure what exactly you meant by pre-matching-learning perspective there, but dont you think its possible to reverse engineer the process. Say, you get 100 SER each for a couple of keywords. Make a cluster profile (with attributes like %nofollow, article length, social attrs etc) that is OK with big G and try to mimic that with some random distortions.
You really need to stop analyzing SEO from a pre-maching-learning perspective. At this point the Google ranking algorithm takes in a bunch of variables, builds a bunch of graphs, makes some comparisons, and decides how to rank.
If the algorithm detects some attribute of a page as an outlier (e.g. % nofollow links) compared to some population, it can compare it to other pages that are also outliers, and cluster them all together. Within the cluster, each page can be labeled spam/not-spam, by a combination of manual labeling and machine learning. If a cluster of N pages exists, where every page is an outlier based on the same attribute, then most likely, any page with that attribute is spam.
I'm sure there are manual tweaks to the algorithm, but I would assume that for the most part, the machine learning algorithms decide what to consider important for ranking, based on context of the search. That might include percent nofollow links, it might include bounce rate, it might include percent of american users who click the link. Who the fuck knows. It doesn't matter.
If you want to avoid the spam algorithm, minimize outlier attributes that would cluster you with other spam pages.