In addition to an SPL meter, you will need a pink noise signal. As Jax2 mentioned, using music to level match is nearly impossible.
I agree it takes time to fully evaluate a single component. For comparisons between two or more, direct A/B is necessary. You can't rely on memory, not even for seconds, because the differences are usually very slight. A selection of least 10 to 12 completely different styles of music and recording techniques is helpful. You might notice something different on one recording, yet nothing noticeable on another.
Then you have the real problem. Which do I *like* better, which *is* better?