Measuring what linguistic information is encoded in continuous representations of language has become a popular area of research. To do this, researchers train “probes”— supervised models designed to extract linguistic structure from embeddings. The line between what constitutes a probe and a model designed to achieve a particular task is often blurred. To fully understand what we are learning about the target language representation—or the instrument with which we performing measurement with for that matter—we would do well to compare probes to classic parsers. As a case study, we consider the structural probe (Hewitt and Manning, 2019), designed to quantify the presence of syntactic information. We create a simple parser that improves upon the performance of the structural probe by 11.4% on UUAS, despite having an identical lightweight parameterization. Under a second less common metric, however, the structural probe outperforms traditional parsers. This begs the question: why should some metrics be preferred for probing and others for parsing?