I've done intelligible speech at about 6 or 8ksps and only two or three bits, which was all you got from early talking toys. (This assumes the speech is already compressed into a very narrow dynamic range.) The toys usually did not go to the expense of using an output anti-aliasing filter though, so there were some strange effects which you'll avoid if you do use one. Even for music, I don't think you'll need nearly as many bits per sample as you probably think you do if the music doesn't have the dynamic range of classical music.
The sound quality of high-end audio cassettes, just before CDs took over, was pretty darned good; and you can get almost the same S/N ratio with just 8 bits, and you can get much better frequency response if your sampling rate is up to the job, and much more consistent high-frequency output, and without the flutter. The first time I heard digital audio was at an Audio Engineering Society (AES) convention. Although it sounded really good and mostly really clean, there was something about it that sounded strange, which I figured later must have been because their anti-alias filters were inadequate, and the small amount of intermodulation distortion probably from the speakers acting on the ultrasonic image aliases were producing these artifacts at audible frequencies.
Here are a couple others of my posts that I think will be helpful:
viewtopic.php?p=30252#p30252
viewtopic.php?p=16393#p16393
See also other posts in the same topics. (Wow—I can't believe these are about a decade old already!)