Is there an easy way to change the timeline logic? I think it would help the switch from other tools. Here is what I am looking for:
Vn..
V4
V3
V2
V1
-
A1
A2
A3
A4
An..
Video and audio should always be displayed separately. This way, you have a clear stack of video where V2 always stay on top of V1 so if you want to make a picture-in-picture, you drop the background on V1 and the foreground picture on V2 and then apply the effect to the clip on V2. The tracks stack the same way you stack a pile of books on a table and you view the stack from above.
This could be a setting so that you can set and forget and those that want to keep it as it is now do not have to think about that setting. There should also be a setting where you can set that you should never display audio as part of the video track. Again - this is for those that comes from other tools where this is the normal way of doing things.

I am not too good with sarcasm, so I am going to answer as truthfully as I can on this (my first language is not English).
As video is not accounting, I guess spreadsheets are not too important :-) The logic in most other applications in this field is video building up the same way as a stack of books and you look at it from above. So if the top book covers the whole view, none of the other layers under shows. This should also be very understandable for people from the graphics and photo area. Both Photoshop and Gimp uses this way of layering.
As for audio, any clip and track in audio has it's own volume normally and are placed in a stereo perspective by panning it to the right or left. There a re a couple of models to choose from there, but I have a feeling you know a bit about that.
Regarding what goes where when you edit, there is actually a need for a way to choose what goes where. The first tracks are easy. Video goes on track 1 (track 0 does not exist in any program I know of and I would have a VERY hard time explaining that to my non geek drinking buddy) and audio goes on track one and two. Usually if video goes on track two, you can do two things - move the audio down to the first free tracks, or have a mechanism to choose what should happen - a "patch" system. Or even both.
It is not too often that you end up with a lof of both video and audio tracks. Very often, building more video layers do not add any audio tracks at all as it is just another picture into the mix, or a text or some other kind of graphic. From experience, most videos will be done with less than eight tracks of audio. 2 tracks of sound from the camera, two tracks of sound for music, two tracks of sound for sound effects and one track of sound for voice over/narrator. This assumes all tracks are mono tracks (so stereo is split in two panned mono tracks) and audio crossfade can be done on one track.
Again - implementing what is the norm in this field of work is actually not so bad. Some systems have tried other ways of doing things. They do not exist any more.