theoretically it works, but in practice the result can be distracting.
Usually we do not take pixels as training sites, but group of pixels together. When the software converts training sites to signatures, it computes the statistics (e.g., mean DN) of the spectral responses within each training class and within each band. This statistics happens within sites and neighboring pixels. If you take too much of training sites for class A than B, the overall statistics will be biased towards A. How is that?
Consider class A has a mean DN ranges from 6 to 8, class B has 10-12. Too much variation among A can push the range to 4-10, making class B sit between 11-12... means confusing class boundaries and misplacement of class definitions (mixed classification).
So for optimum or near-optimum result, use equal number of training sites for all class, with same/similar number of pixels in each.