zero-copy TCP From: Ingo Molnar (mingo@elte.hu) Date: Sat Sep 02 2000 - 03:45:41 EST On Sat, 2 Sep 2000, Dan Maas wrote: > There are various other tricks that can be done to speed up network > servers, like passing files directly from the buffer cache to the > network card. This one is currently frowned upon by the Linux > community, [...] FYI, the TUX patch (released yesterday) includes a lightweight zero-copy TCP implementation for the 2.4 Linux kernel. The interface is not yet exported to user-space (simply because TUX uses it from kernel-space so the user-space bits were not needed), but the network driver framework and TCP-stack bits are there, so the hard part is done. The two most widely used gigabit drivers are 'converted' to support zero-copy, the SysKonnect and the Acenic driver (the modifications are well tested). I plan to add the user-space bits in the near future. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jes Sorensen (jes@linuxcare.com) Date: Sat Sep 02 2000 - 16:20:48 EST >>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes: Ingo> On Sat, 2 Sep 2000, Dan Maas wrote: >> There are various other tricks that can be done to speed up network >> servers, like passing files directly from the buffer cache to the >> network card. This one is currently frowned upon by the Linux >> community, [...] Ingo> FYI, the TUX patch (released yesterday) includes a lightweight Ingo> zero-copy TCP implementation for the 2.4 Linux kernel. The Ingo> interface is not yet exported to user-space (simply because TUX Ingo> uses it from kernel-space so the user-space bits were not Ingo> needed), but the network driver framework and TCP-stack bits are Ingo> there, so the hard part is done. The two most widely used Ingo> gigabit drivers are 'converted' to support zero-copy, the Ingo> SysKonnect and the Acenic driver (the modifications are well Ingo> tested). I plan to add the user-space bits in the near future. Could you comment a bit on the design you used or do I have to go read the code? Some of us had a good chat at OLS about how to do zero copy TCP xmits by kiobufifying the skb's. Jes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jeff V. Merkey (jmerkey@timpanogas.com) Date: Sat Sep 02 2000 - 16:25:48 EST The entire Linux Network subsystem needs an overhaul. The code copies data all over the place. I am at present pulling it apart and porting it to MANOS, and what a mess indeed. In NetWare, the only time data ever gets copied from incoming packets is: 1. A copy to userspace at a stream head. 2. An incoming write that gets copied into the file cache. Reads from cache are never copied. In fact, the network server locks a file cache page and sends it unaltered to the network drivers and DMA's directly from it. Since NetWare has WTD's these I/O requests get processed at the highest possible priority. In networking, the enemy is LATENCY for fast performance. That's why NetWare can handle 5000 users and Linux barfs on 100 in similiar tests. Copying increases latency, and the long code paths in the Linux Network layer. Jeff Jes Sorensen wrote: > > >>>>> "Ingo" == Ingo Molnar <mingo@elte.hu> writes: > > Ingo> On Sat, 2 Sep 2000, Dan Maas wrote: > > >> There are various other tricks that can be done to speed up network > >> servers, like passing files directly from the buffer cache to the > >> network card. This one is currently frowned upon by the Linux > >> community, [...] > > Ingo> FYI, the TUX patch (released yesterday) includes a lightweight > Ingo> zero-copy TCP implementation for the 2.4 Linux kernel. The > Ingo> interface is not yet exported to user-space (simply because TUX > Ingo> uses it from kernel-space so the user-space bits were not > Ingo> needed), but the network driver framework and TCP-stack bits are > Ingo> there, so the hard part is done. The two most widely used > Ingo> gigabit drivers are 'converted' to support zero-copy, the > Ingo> SysKonnect and the Acenic driver (the modifications are well > Ingo> tested). I plan to add the user-space bits in the near future. > > Could you comment a bit on the design you used or do I have to go read > the code? Some of us had a good chat at OLS about how to do zero copy > TCP xmits by kiobufifying the skb's. > > Jes > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Alan Cox (alan@lxorguk.ukuu.org.uk) Date: Sat Sep 02 2000 - 16:35:11 EST > to MANOS, and what a mess indeed. In NetWare, the only time data ever > gets copied from incoming packets is: > > 1. A copy to userspace at a stream head. > 2. An incoming write that gets copied into the file cache. Sounds like Linux - one DMA and one copy to user space. > Reads from cache are never copied. In fact, the network server locks a > file cache page and sends it unaltered to the network drivers and DMA's > directly from it. Since NetWare has WTD's these I/O requests get Doesn't work with IP - you have to be able to checksum the data. For the recent cards that can handle this have a look at TUX. The work is there ready for 2.5 Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jeff V. Merkey (jmerkey@timpanogas.com) Date: Sat Sep 02 2000 - 16:45:42 EST Alan Cox wrote: > > > to MANOS, and what a mess indeed. In NetWare, the only time data ever > > gets copied from incoming packets is: > > > > 1. A copy to userspace at a stream head. > > 2. An incoming write that gets copied into the file cache. > > Sounds like Linux - one DMA and one copy to user space. Alan, Please. I'm in your code and there are copies all over the place. I agree you have a "fast path" for most stuff, but there's all kinds of handles lookups, linear list searching like while (x) { x = x->next } all over the place that increases latency. Not to mention the overhead of the type of interrupt and trap gates that suck up about 50 clocks to fetch the IDT, PDE, and GDT tables for every interrupt. NetWare copies nothing in TCPIP except at the stream head. Why do you need to copy data anyway to checksum an IP packet anyway? I noticed you do the right thing and keep the headers and data as separate fragments during header construction, so why do you need to copy data for checksumming? Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Alan Cox (alan@lxorguk.ukuu.org.uk) Date: Sat Sep 02 2000 - 17:10:25 EST > > Sounds like Linux - one DMA and one copy to user space. > > Alan, Please. I'm in your code and there are copies all over the > place. I agree you have a "fast path" for most stuff, but there's all There arent copies all over the case for the paths that occur. Like 99.999% of the time. Fragmented packets dont happen except for NFS (which is a rather broken protocol anyway). One DMA, one copy to user space > kinds of handles lookups, linear list searching like > > while (x) > { > x = x->next > } timers are constructed to be close to O(1), the tcp hash isnt a linear lookup, the socket operations from user space use file-> dereferences not a lookup > nothing in TCPIP except at the stream head. Why do you need to copy > data anyway to checksum an IP packet anyway? I noticed you do the right > thing and keep the headers and data as separate fragments during header > construction, so why do you need to copy data for checksumming? We dont copy for checksumming. We fold the single user space copy and the checksum operation into one path, because on any modern CPU it costs precisely the same to copy as to copy/checksum. I don't think you've actually sat and instrumented the TCP code Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jeff V. Merkey (jmerkey@timpanogas.com) Date: Sat Sep 02 2000 - 17:20:58 EST Alan Cox wrote: > > > > Sounds like Linux - one DMA and one copy to user space. > > > > Alan, Please. I'm in your code and there are copies all over the > > place. I agree you have a "fast path" for most stuff, but there's all > > There arent copies all over the case for the paths that occur. Like 99.999% > of the time. Fragmented packets dont happen except for NFS (which is a rather > broken protocol anyway). There are. > > One DMA, one copy to user space > > > kinds of handles lookups, linear list searching like > > > > while (x) > > { > > x = x->next > > } > > timers are constructed to be close to O(1), the tcp hash isnt a linear lookup, > the socket operations from user space use file-> dereferences not a lookup It is is there's a hash collision. > > > nothing in TCPIP except at the stream head. Why do you need to copy > > data anyway to checksum an IP packet anyway? I noticed you do the right > > thing and keep the headers and data as separate fragments during header > > construction, so why do you need to copy data for checksumming? > > We dont copy for checksumming. We fold the single user space copy and the > checksum operation into one path, because on any modern CPU it costs precisely > the same to copy as to copy/checksum. > > I don't think you've actually sat and instrumented the TCP code In Linux, no, in Netware, yes. I'm in your TCP code now and it's fairly large. Jeff > > Alan > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Alan Cox (alan@lxorguk.ukuu.org.uk) Date: Sat Sep 02 2000 - 17:21:13 EST > > There arent copies all over the case for the paths that occur. Like 99.999% > > of the time. Fragmented packets dont happen except for NFS (which is a rather > > broken protocol anyway). > > There are. You forgot to cite them > > the socket operations from user space use file-> dereferences not a lookup > > It is is there's a hash collision. So you want to compute a perfect hash from unknown data which may also be a hostile attacking your hash function. If you can do that, stop off and claim a PhD Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jeff V. Merkey (jmerkey@timpanogas.com) Date: Sat Sep 02 2000 - 17:28:18 EST Alan Cox wrote: > > We dont copy for checksumming. We fold the single user space copy and the > checksum operation into one path, because on any modern CPU it costs precisely > the same to copy as to copy/checksum. You stated in an earlier message you copied the data when you caclulated the TCPIP checksum? No you say you don't. Perhaps I misunderstood. > > I don't think you've actually sat and instrumented the TCP code The TCPIP stack in Wolf Mountain has my name as the author, and it was one of the nastiest projects I've ever done. OSPF routing is bitch BTW. Try again. > > Alan > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Alan Cox (alan@lxorguk.ukuu.org.uk) Date: Sat Sep 02 2000 - 17:30:19 EST > You stated in an earlier message you copied the data when you caclulated > the TCPIP checksum? No you say you don't. Perhaps I misunderstood. We do a single copy/checksum from user space. You have to do the copy because the packet may not be DMAable, may not be aligned for most PCI hardware and numerous other things. Since that copy costs as much as the checksum its effectively free in the checksum computation. It also avoids considerable complexity on the TCP paths when you need to retransmit. > > I don't think you've actually sat and instrumented the TCP code > > The TCPIP stack in Wolf Mountain has my name as the author, and it was The Linux TCP code.. > one of the nastiest projects I've ever done. OSPF routing is bitch > BTW. Try again. OSPF is a matter of getting the graph theory right. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Andi Kleen (ak@suse.de) Date: Sat Sep 02 2000 - 17:39:38 EST On Sat, Sep 02, 2000 at 04:28:18PM -0600, Jeff V. Merkey wrote: > > > Alan Cox wrote: > > > > We dont copy for checksumming. We fold the single user space copy and the > > checksum operation into one path, because on any modern CPU it costs precisely > > the same to copy as to copy/checksum. > > You stated in an earlier message you copied the data when you caclulated > the TCPIP checksum? No you say you don't. Perhaps I misunderstood. Linux always does a single copy for TCP, and the checksum is folded into that. Doing just the checksum alone wouldn't be much less costly. [Note this is only true for 2.4 in the fast path, 2.2 RX usually does checksum and copy-to-user separated, unless you have hardware RX checksumming For TX we always do a single copy checksum out of user space or out of the page cache when you use sendfile or mmap] -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jeff V. Merkey (jmerkey@timpanogas.com) Date: Sat Sep 02 2000 - 17:47:33 EST Andi Kleen wrote: > > On Sat, Sep 02, 2000 at 04:28:18PM -0600, Jeff V. Merkey wrote: > > > > > > Alan Cox wrote: > > > > > > We dont copy for checksumming. We fold the single user space copy and the > > > checksum operation into one path, because on any modern CPU it costs precisely > > > the same to copy as to copy/checksum. > > > > You stated in an earlier message you copied the data when you caclulated > > the TCPIP checksum? No you say you don't. Perhaps I misunderstood. > > Linux always does a single copy for TCP, and the checksum is folded into > that. Doing just the checksum alone wouldn't be much less costly. > > [Note this is only true for 2.4 in the fast path, 2.2 RX usually does > checksum and copy-to-user separated, unless you have hardware RX checksumming > > For TX we always do a single copy checksum out of user space or out of the > page cache when you use sendfile or mmap] This makes sense. Jeff > > -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Ingo Molnar (mingo@elte.hu) Date: Sun Sep 03 2000 - 03:29:50 EST On Sat, 2 Sep 2000, Jeff V. Merkey wrote: > while (x) > { > x = x->next > } > > all over the place that increases latency. [...] i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If it's all over the place and if it increases latency, you certainly can show at least one such place. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jeff V. Merkey (jmerkey@timpanogas.com) Date: Tue Sep 05 2000 - 05:14:10 EST Ingo, When I have time to do this exercise, I will. I've finished merging Alan's Code into MANOS (completed last night). Most of the cases I saw where there were copies were not fast path. It takes some time to go through all this code you guys have written. It is actually looking good. Jeff Ingo Molnar wrote: > > On Sat, 2 Sep 2000, Jeff V. Merkey wrote: > > > while (x) > > { > > x = x->next > > } > > > > all over the place that increases latency. [...] > > i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If > it's all over the place and if it increases latency, you certainly can > show at least one such place. > > Ingo > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Ingo Molnar (mingo@elte.hu) Date: Tue Sep 05 2000 - 05:39:03 EST On Tue, 5 Sep 2000, Jeff V. Merkey wrote: > > > while (x) > > > { > > > x = x->next > > > } > > > > > > all over the place that increases latency. [...] > > > > i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If > > it's all over the place and if it increases latency, you certainly can > > show at least one such place. > > When I have time to do this exercise, I will. [...] well, your original claim (quoted above) shows that you have identified numerous such places already, so you dont have to do any additional 'exercise'. The "all over the place" code shouldnt be too hard to find again - please just say filename and line number in any kernel version of your choice and we'll look into it. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jeff V. Merkey (jmerkey@timpanogas.com) Date: Tue Sep 05 2000 - 05:58:10 EST Alright Ingo, you asked for it. I am going through it now and going over ALL my notes. I will catalog ALL of them and post it. Is this what you really want? :-) Jeff Ingo Molnar wrote: > > On Tue, 5 Sep 2000, Jeff V. Merkey wrote: > > > > > while (x) > > > > { > > > > x = x->next > > > > } > > > > > > > > all over the place that increases latency. [...] > > > > > > i challenge you to show one such place in the 2.4.0-test8-pre2 kernel. If > > > it's all over the place and if it increases latency, you certainly can > > > show at least one such place. > > > > When I have time to do this exercise, I will. [...] > > well, your original claim (quoted above) shows that you have identified > numerous such places already, so you dont have to do any additional > 'exercise'. The "all over the place" code shouldnt be too hard to find > again - please just say filename and line number in any kernel version of > your choice and we'll look into it. > > Ingo > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Ingo Molnar (mingo@elte.hu) Date: Tue Sep 05 2000 - 06:15:25 EST On Tue, 5 Sep 2000, Jeff V. Merkey wrote: > Alright Ingo, you asked for it. I am going through it now and going > over ALL my notes. I will catalog ALL of them and post it. Is this > what you really want? yes, this would be the best indeed, to get those places fixed. But if you dont want to spend your time on that then it's enough to just post a single incident of such inefficiency and list-walking that impacts latency like you claim. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jeff V. Merkey (jmerkey@timpanogas.com) Date: Tue Sep 05 2000 - 06:09:10 EST The origin of this comment was related to a comparison of the MSM/TSM/CSM layer in NetWare and Linux. I've already said that Alan's code handles fast paths well and from what I've seen is comparable to NetWare. The areas I saw where sideband cases and issues of fragment re-assembly. It's as good as what's in NetWare. Jeff Ingo Molnar wrote: > > On Tue, 5 Sep 2000, Jeff V. Merkey wrote: > > > Alright Ingo, you asked for it. I am going through it now and going > > over ALL my notes. I will catalog ALL of them and post it. Is this > > what you really want? > > yes, this would be the best indeed, to get those places fixed. But if you > dont want to spend your time on that then it's enough to just post a > single incident of such inefficiency and list-walking that impacts latency > like you claim. > > Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Ingo Molnar (mingo@elte.hu) Date: Tue Sep 05 2000 - 06:41:05 EST On Tue, 5 Sep 2000, Jeff V. Merkey wrote: > The origin of this comment was related to a comparison of the > MSM/TSM/CSM layer in NetWare and Linux. I've already said that Alan's > code handles fast paths well and from what I've seen is comparable to > NetWare. [...] can we thus take this as a retraction of your below quoted three derogatory comments? " The entire Linux Network subsystem needs an overhaul. " " In networking, the enemy is LATENCY for fast performance. That's why NetWare can handle 5000 users and Linux barfs on 100 in similiar tests. Copying increases latency, and the long code paths in the Linux Network layer. " " Alan, Please. I'm in your code and there are copies all over the place. I agree you have a "fast path" for most stuff, but there's all kinds of handles lookups, linear list searching like while (x) { x = x->next } all over the place that increases latency. " Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jeff V. Merkey (jmerkey@timpanogas.com) Date: Tue Sep 05 2000 - 06:41:35 EST Ingo Molnar wrote: > > On Tue, 5 Sep 2000, Jeff V. Merkey wrote: > > > The origin of this comment was related to a comparison of the > > MSM/TSM/CSM layer in NetWare and Linux. I've already said that Alan's > > code handles fast paths well and from what I've seen is comparable to > > NetWare. [...] > > can we thus take this as a retraction of your below quoted three > derogatory comments? > > " The entire Linux Network subsystem needs an overhaul. " To support the performance metrics of NetWare, there are some changes I will make that will allow Alan's code to beat Native NetWare. One is allowing pre-scan protocol stacks to exist. Another is a WTD optimization to allow Alan's code to tag pages in the page cache and post them with a preemptive IO WTD. Another is moving ALL of the routing code into the kernel space. Another is consolidation of bottom ad top halves to allow a single interrupt thread to run all the way into the router and out without the need to schedule. Another is moving the NCP server into the kernel. Another is enabling "gang" tagging and release of a singe cache page by hundereds or thousands of users at one tme for incoming reads. The list is very long. > > " In networking, the enemy is LATENCY for fast performance. That's why > NetWare can handle 5000 users and Linux barfs on 100 in similiar tests. > Copying increases latency, and the long code paths in the Linux Network > layer. " > > " Alan, Please. I'm in your code and there are copies all over the > place. I agree you have a "fast path" for most stuff, but there's all > kinds of handles lookups, linear list searching like > > while (x) > { > x = x->next > } > > all over the place that increases latency. " > > Ingo I already said this code is more than suitable, and better yet, it's something folks are familiar with in Linux. Alan and I went over some of this off line. Sorry you missed it. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Ingo Molnar (mingo@elte.hu) Date: Tue Sep 05 2000 - 06:16:19 EST btw., - the maintainers of the 2.4 networking and TCP/IP code are Alexey Kuznetsov and David S. Miller - please direct your findings towards them, not me :-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jeff V. Merkey (jmerkey@timpanogas.com) Date: Tue Sep 05 2000 - 06:10:28 EST You opened your mouth. :-) Jeff Ingo Molnar wrote: > > btw., - the maintainers of the 2.4 networking and TCP/IP code are Alexey > Kuznetsov and David S. Miller - please direct your findings towards them, > not me :-) > > Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Ingo Molnar (mingo@elte.hu) Date: Sun Sep 03 2000 - 03:28:18 EST On Sat, 2 Sep 2000, Jeff V. Merkey wrote: > Alan, Please. I'm in your code and there are copies all over the > place. I agree you have a "fast path" for most stuff, but there's > all kinds of handles lookups, linear list searching like have you ever bothered actually measuring the impact? I have. Is the Linux kernel perfect? Not at all. I dont understand why you take this as a personal insult - you are certainly free to add your improvements, no insults or patronizing is necessery, this is a technical forum. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Andi Kleen (ak@suse.de) Date: Sat Sep 02 2000 - 17:02:27 EST On Sat, Sep 02, 2000 at 10:35:11PM +0100, Alan Cox wrote: > > to MANOS, and what a mess indeed. In NetWare, the only time data ever > > gets copied from incoming packets is: > > > > 1. A copy to userspace at a stream head. > > 2. An incoming write that gets copied into the file cache. > > Sounds like Linux - one DMA and one copy to user space. Given for NFS over UDP it is usually more, because of the defragmentation pass. That will be fixed in 2.5 and the code is already writen, just wants to be ported to kiobufs. 2.4 NFSD at least receives directly into the page cache unlike 2.2 (so it'll do two copies, three usually on alpha) Samba probably does more copies though, I don't think it receives directly into a mmap'ed buffer (so there are at least two copies to write something to disk). -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jes Sorensen (jes@linuxcare.com) Date: Sat Sep 02 2000 - 16:40:18 EST >>>>> "Jeff" == Jeff V Merkey <jmerkey@timpanogas.com> writes: Jeff, could you start by learning to quote email and not send a full copy of the entire email you reply to (read rfc1855). Jeff> The entire Linux Network subsystem needs an overhaul. The code Jeff> copies data all over the place. I am at present pulling it apart Jeff> and porting it to MANOS, and what a mess indeed. In NetWare, the Jeff> only time data ever gets copied from incoming packets is: Try and understand the code before you make such bold statements. Jeff> 1. A copy to userspace at a stream head. 2. An incoming write Jeff> that gets copied into the file cache. Jeff> Reads from cache are never copied. In fact, the network server Jeff> locks a file cache page and sends it unaltered to the network Jeff> drivers and DMA's directly from it. Since NetWare has WTD's Jeff> these I/O requests get processed at the highest possible Jeff> priority. In networking, the enemy is LATENCY for fast Jeff> performance. That's why NetWare can handle 5000 users and Linux Jeff> barfs on 100 in similiar tests. Copying increases latency, and Jeff> the long code paths in the Linux Network layer. You can't DMA directly from a file cache page unless you have a network card that does scatter/gather DMA and surprise surprise, 80-90% of the cards on the market don't support this. Besides that you need to do copy-on-write if you want to be able to do zero copy on write() from user space, marking data copy on write is *expensive* on x86 SMP boxes since you have to modify the tlb on all processors. On top of that you have to look at the packet size, for small packets a copy is often a lot cheaper than modifying the page tables, even on UP systems so you need a copy/break scheme here. As wrt your statement on latency then it's nice to see that you don't know what you are talking about. Latency is one issue in fast networking it's far from the only one. Latency is important for message passing type applications however for bulk data transfers it's less relevant since you really want deep pipelining here and properly written applications. If you TCP window is too small even zero latency will only buy you soo much on a really fast network. Jes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jamie Lokier (lk@tantalophile.demon.co.uk) Date: Sat Sep 02 2000 - 22:22:44 EST Jes Sorensen wrote: > You can't DMA directly from a file cache page unless you have a > network card that does scatter/gather DMA and surprise surprise, > 80-90% of the cards on the market don't support this. Besides that you > need to do copy-on-write if you want to be able to do zero copy on > write() from user space, marking data copy on write is *expensive* on > x86 SMP boxes since you have to modify the tlb on all > processors. On top of that you have to look at the packet size, for > small packets a copy is often a lot cheaper than modifying the page > tables, even on UP systems so you need a copy/break scheme here. I just thought I'd mention that you can do zero copy TCP in and out *without* any page marking schemes. All you need is a network card with quite a lot of RAM and some intelligence. An Alteon could do it, with extra RAM or an impressively underloaded network. (for example) http://www.digital.com/info/DTJS05/ -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Linus Torvalds (torvalds@transmeta.com) Date: Sun Sep 03 2000 - 01:33:27 EST In article <20000903052244.B15788@pcep-jamie.cern.ch>, Jamie Lokier <lk@tantalophile.demon.co.uk> wrote: > >I just thought I'd mention that you can do zero copy TCP in and out >*without* any page marking schemes. All you need is a network card with >quite a lot of RAM and some intelligence. An Alteon could do it, with >extra RAM or an impressively underloaded network. > >(for example) http://www.digital.com/info/DTJS05/ The thing is, that at least historically it has always been a bad bet to bet on special-purpose hardware over general-purpose stuff. What I'm saying is that basically you should not design your TCP layer around the 0.1% of cards that have tons of intelligence, when you have a general-purpose CPU that tends to be faster in the end. The smart cards can actually have higher latency than just doing it the "stupid" way with the CPU. Yes, they'll offload some of the computation, and may make system throughput better, but at what cost? [ Same old example: just calculate how quickly you can get your packet on the wire with a smart card that does checksumming in hardware, and do the same calculations with a CPU that does the checksums. Take into account that the checksum is at the _head_ of the packet. The CPU will win. Proof: the data to be sent out is in RAM. In fact, often it is cached in the CPU these days. In order to start sending out the packet, the smart card has to move all of the data from RAM/cache over the bus to the card. It can only start actually sending after that. Cost: bus speed to copy it over. In contrast, if you do it on the CPU, you can basically start feeding the packet out on the net after doing a CPU checksum that is limited by RAM/cache speeds. Bus speed isn't the limiting factor any more on packet latency, as you can send out the start of the packet on the network before the whole packet has even been copied over the internal bus! ] So. Smart cards are not necessarily better for latency. They are certainly not cheaper. They _are_ better for throughput, no question about that. But so is adding another CPU. Or beefing up your memory subsystem. Or any number of other things that are more generic than some smart network card - and often cheaper because they are "standard components", useful regardless of _what_ you do. End result: smart cards only make sense in systems that are really pushing the performance envelope. Which, after all, is not that common, as it's usually easier to just beef up the machine in other ways until the network is not the worst bottle-neck. Very few places outside benchmark labs have networks _that_ studly. Right now gigabit is heavy-duty enough that it is worth smart cards. The same used to be true about the first generation of 100Mbit cards. The same will be true of 10Gbps cards in another few years. But basically, they'll probably always end up being the exception rather than the rule, unless they become so cheap that it doesn't matter. But "cheap" and "pushing the performance envelope" do not tend to go hand in hand. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jamie Lokier (lk@tantalophile.demon.co.uk) Date: Sun Sep 03 2000 - 15:46:54 EST Linus Torvalds wrote: > Proof: the data to be sent out is in RAM. In fact, often it is cached > in the CPU these days. In order to start sending out the packet, the > smart card has to move all of the data from RAM/cache over the bus to > the card. It can only start actually sending after that. Cost: bus > speed to copy it over. > > In contrast, if you do it on the CPU, you can basically start feeding > the packet out on the net after doing a CPU checksum that is limited > by RAM/cache speeds. Bus speed isn't the limiting factor any more on > packet latency, as you can send out the start of the packet on the > network before the whole packet has even been copied over the internal > bus! Nice point! Only valid for TCP & UDP though. When people want _real_ low latency, they don't use TCP or UDP, and they certainly don't put data checksums at the start. They still aim for zero copies. That pass, even over cached data, is still significant. > Right now gigabit is heavy-duty enough that it is worth smart cards. > The same used to be true about the first generation of 100Mbit cards. > The same will be true of 10Gbps cards in another few years. But > basically, they'll probably always end up being the exception rather > than the rule, unless they become so cheap that it doesn't matter. But > "cheap" and "pushing the performance envelope" do not tend to go hand in > hand. Fair enough. Please read my description of a zero-copy scheme that doesn't require much intelligence on the card though. I think it's a neat kernel trick that might just pay off. Sometimes, maybe. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Linus Torvalds (torvalds@transmeta.com) Date: Sun Sep 03 2000 - 16:03:03 EST On Sun, 3 Sep 2000, Jamie Lokier wrote: > > Nice point! Only valid for TCP & UDP though. Yeah. But "we need oxygen" is only a valid point for carbon-based life-forms. You might as well argue that oxygen is not avalid criteria for being livable, because it's only valid for the particular kind of creatures we are. Basically, only TCP and UDP really matter. Decnet, IPX, etc don't really make a big selling point any more. > When people want _real_ low latency, they don't use TCP or UDP, and they > certainly don't put data checksums at the start. They still aim for > zero copies. That pass, even over cached data, is still significant. I disagree. Look at history. Exercise 1: name a protocol that did something like that (yes, I know, there are multiple). Exercise 2: name one of them that is still relevant today. See? Performance, in the end, is very much secondary. It doesn't matter one whit if you perform better than everybody else, if you cannot _talk_ to everybody else. I think the RISC vendors found that out. And I think most network vendors find that out. (Yes, I know, you're probably talking about things like the networking protocols for clusters etc. I'm just saying that historically such special-purpose stuff always tends to end up being not as good as the "real thing".) > Fair enough. Please read my description of a zero-copy scheme that > doesn't require much intelligence on the card though. I think it's a > neat kernel trick that might just pay off. Sometimes, maybe. We could certainly try to do better. But some of the scemes I've seen have implied a lot of complexity for gains that aren't actually real in the end (eg playing expensive games with memory mapping in order to avoid a copy that ends up happening anyway because the particular card you're using doesn't do scatter-gather: you'd perform a lot better if you just did the copy outright and forgot about the expensive games - which is what Linux does). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Jeff V. Merkey (jmerkey@timpanogas.com) Date: Tue Sep 05 2000 - 05:36:05 EST Linus Torvalds wrote: > > > > Basically, only TCP and UDP really matter. Decnet, IPX, etc don't really > make a big selling point any more. > > Linus, IPX is a really good LAN protocol (but totally sucks for internet). A full blown NCP server in-kernel that's toughtly coupled to the page cache running over IPX would make flames shoot out of the back of a Linux server, and make NT like look an old lady hobbling down the street. There's no need to configure client addresses with it, and for file and print, it's the best. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Henning P. Schmiedehausen (hps@tanstaafl.de) Date: Tue Sep 05 2000 - 08:34:02 EST jmerkey@timpanogas.com (Jeff V. Merkey) writes: >Linus Torvalds wrote: >> >> >> >> Basically, only TCP and UDP really matter. Decnet, IPX, etc don't really >> make a big selling point any more. >> >> >Linus, >IPX is a really good LAN protocol (but totally sucks for internet). A >full blown NCP server in-kernel that's toughtly coupled to the page >cache running over IPX would make flames shoot out of the back of a >Linux server, and make NT like look an old lady hobbling down the >street. There's no need to configure client addresses with it, and for >file and print, it's the best. And it would be a good bit of necrophilia, too. Jeff, Netware is dead. Please leave it there. IP won. The number of new Netware Installations (as compared to existing or just upgrades) is close (really close) to nil. Regards Henning -- Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer INTERMETA - Gesellschaft fuer Mehrwertdienste mbH hps@intermeta.de Am Schwabachgrund 22 Fon.: 09131 / 50654-0 info@intermeta.de D-91054 Buckenhof Fax.: 09131 / 50654-20 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Dan Hollis (goemon@anime.net) Date: Tue Sep 05 2000 - 13:25:12 EST On 5 Sep 2000, Henning P. Schmiedehausen wrote: > jmerkey@timpanogas.com (Jeff V. Merkey) writes: > >IPX is a really good LAN protocol (but totally sucks for internet). A > Jeff, Netware is dead. Please leave it there. IP won. The number of > new Netware Installations (as compared to existing or just upgrades) > is close (really close) to nil. I think you mean IPX is dead. Netware *could* work over TCP or UDP. IP is definitely king. Even micro$haft gave up on NetBEUI. -Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Henning P . Schmiedehausen (hps@tanstaafl.de) Date: Tue Sep 05 2000 - 14:32:46 EST On Tue, Sep 05, 2000 at 11:25:12AM -0700, Dan Hollis wrote: > On 5 Sep 2000, Henning P. Schmiedehausen wrote: > > jmerkey@timpanogas.com (Jeff V. Merkey) writes: > > >IPX is a really good LAN protocol (but totally sucks for internet). A > > Jeff, Netware is dead. Please leave it there. IP won. The number of > > new Netware Installations (as compared to existing or just upgrades) > > is close (really close) to nil. > > I think you mean IPX is dead. Netware *could* work over TCP or UDP. > IP is definitely king. Even micro$haft gave up on NetBEUI. Yep, thats' what I meant. Sorry that I was not clearer. But I think that there are even with NetWare on IP not many new installations. There is lots of migration of existing servers and keeping existing systems alive but new rollouts? But then again, maybe with MANOS and OpenNetWare, everything will be different. Regards Henning -- Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer INTERMETA - Gesellschaft fuer Mehrwertdienste mbH hps@intermeta.de Am Schwabachgrund 22 Fon.: 09131 / 50654-0 info@intermeta.de D-91054 Buckenhof Fax.: 09131 / 50654-20 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP From: Chris Wedgwood (cw@f00f.org) Date: Tue Sep 05 2000 - 14:20:31 EST On Tue, Sep 05, 2000 at 03:34:02PM +0200, Henning P. Schmiedehausen wrote: And it would be a good bit of necrophilia, too. Jeff, Netware is dead. Please leave it there. IP won. The number of new Netware Installations (as compared to existing or just upgrades) is close (really close) to nil. Sadly neither of these comments are true -- there are still a great many NetWare installations and many of the existing installations are far from dead as they move to IP... --cw - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/