SFP modules on a board running Linux

We recently worked on Linux support for a custom hardware platform based on the Texas Instruments AM335x system-on-chip, with a somewhat special networking setup: each of the two ports of the AM335x Ethernet MAC was connected to a Microchip VSC8572 Ethernet PHY, which itself allowed to access an SFP cage. In addition, the I2C buses connected to the SFP cages, which are used at runtime to communicate with the inserted SFP modules, instead of being connected to an I2C controller of the system-on-chip as they usually are, where connected to the I2C controller embedded in the VSC8572 PHYs.

The below diagram depicts the overall hardware layout:

Our goal was to use Linux and to offer runtime dynamic reconfiguration of the networking links based the SFP module plugged in. To achieve this we used, and extended, a combination of Linux kernel internal frameworks such as Phylink or the SFP bus support; and of networking device drivers. In this blog post, we’ll share some background information about these technologies, the challenges we faced and our current status.

Introduction to the SFP interface

SFP moduleThe small form-factor pluggable (SFP) is a hot-pluggable network interface module. Its electrical interface and its form-factor are well specified, which allows industry players to build platforms that can host SFP modules, and be sure that they will be able to use any available SFP module on the market. It is commonly used in the networking industry as it allows connecting various types of transceivers to a fixed interface.

A SFP cage provides in addition to data signals a number of control signals:

  • a Tx_Fault pin, for transmitter fault indication
  • a Tx_Disable pin, for disabling optical output
  • a MOD_Abs pin, to detect the absence of a module
  • an Rx_LOS pin, to denote a receiver loss of signal
  • a 2-wire data and clock lines, used to communicate with the modules

Modules plugged into SFP cages can be direct attached cables, in which case they do not have any built-in transceiver, or they can include a transceiver (i.e an embedded PHY), which transforms the signal into another format. This means that in our setup, there can be two PHYs between the Ethernet MAC and the physical medium: the Microchip VSC8572 PHY and the PHY embedded into the SFP module that is plugged in.

All SFP modules embed an EEPROM, accessible at a standardized I2C address and with a standardized format, which allows the host system to discover which SFP modules are connected what are their capabilities. In addition, if the SFP modules contains an embedded PHY, it is also accessible through the same I2C bus.

Challenges

We had to overcome a few challenges to get this setup working, using a mainline Linux kernel.

As we discussed earlier, having SFP modules meant the whole MAC-PHY-SFP link has to be reconfigured at runtime, as the PHY in the SFP module is hot-pluggable. To solve this issue a framework called Phylink, was introduced in mid-2017 to represent networking links and allowing their component to share states and to be reconfigured at runtime. For us, this meant we had to first convert the CPSW MAC driver to use this phylink framework. For a detailed explanation of what composes Ethernet links and why Phylink is needed, we gave a talk at the Embedded Linux Conference Europe in 2018. While we were working on this and after we first moved the CPSW MAC driver to use Phylink, this driver was rewritten and a new CPSW MAC driver was sent upstream (CONFIG_TI_CPSW vs CONFIG_TI_CPSW_SWITCHDEV). We are still using the old driver for now, and this is why we did not send our patches upstream as we think it does not make sense to convert a driver which is now deprecated.

A second challenge was to integrate the 2-wire capability of the VSC8572 PHY into the networking PHY and SFP common code, as our SFP modules I2C bus is connected to the PHY and not an I2C controller from the system-on-chip. We decided to expose this PHY 2-wire capability as an SMBus controller, as the functionality offered by the PHY does not make it a fully I2C compliant controller.

Outcome

The challenges described above made the project quite complex overall, but we were able to get SFP modules working, and to dynamically switch modes depending on the capabilities of the one currently plugged-in. We tested with both direct attached cables and a wide variety of SFP modules of different speeds and functionality. At the moment only a few patches were sent upstream, but we’ll contribute more over time.

For an overview of some of the patches we made and used, we pushed a branch on Github (be aware those patches aren’t upstream yet and they will need some further work to be acceptable upstream). Here is the details of the patches:

In terms of Device Tree representation, we first have a description of the two SFP cages. They describe the different GPIOs used for the control signals, as well as the I2C bus that goes to each SFP cage. Note that the gpio_sfp is a GPIO expander, itself on I2C, rather than directly GPIOs of the system-on-chip.

/ {
       sfp_eth0: sfp-eth0 {
               compatible = "sff,sfp";
               i2c-bus = <&phy0>;
               los-gpios = <&gpio_sfp 3 GPIO_ACTIVE_HIGH>;
               mod-def0-gpios = <&gpio_sfp 4 GPIO_ACTIVE_LOW>;
               tx-disable-gpios = <&gpio_sfp 5 GPIO_ACTIVE_HIGH>;
               tx-fault-gpios = <&gpio_sfp 6 GPIO_ACTIVE_HIGH>;
       };

       sfp_eth1: sfp-eth1 {
               compatible = "sff,sfp";
               i2c-bus = <&phy1>;
               los-gpios = <&gpio_sfp 10 GPIO_ACTIVE_HIGH>;
               mod-def0-gpios = <&gpio_sfp 11 GPIO_ACTIVE_LOW>;
               tx-disable-gpios = <&gpio_sfp 13 GPIO_ACTIVE_HIGH>;
               tx-fault-gpios  = <&gpio_sfp 12 GPIO_ACTIVE_HIGH>;
       };
};

Then the MAC is described as follows:

&mac {
      pinctrl-names = "default";
       pinctrl-0 = <&cpsw_default>;
       status = "okay";
       dual_emac;
};

&cpsw_emac0 {
       status = "okay";
       phy = <&phy0>;
       phy-mode = "rgmii-id";
       dual_emac_res_vlan = <1>;
};

&cpsw_emac1 {
       status = "okay";
       phy = <&phy1>;
       phy-mode = "rgmii-id";
       dual_emac_res_vlan = <2>;
};

So we have both ports of the MAC enabled with a RGMII interface to the PHY. And finally the MDIO bus of the system-on-chip is described as follows. We have two sub-nodes, one for each VSC8572 PHY, respectively at address 0x0 and 0x1 on the CPSW MDIO bus. Each PHY is connected to its respective SFP cage node (sfp_eth0 and sfp_eth1) and provides access to the SFP EEPROM as regular EEPROMs.

&davinci_mdio {
       pinctrl-names = "default";
       pinctrl-0 = <&davinci_mdio_default>;
       status = "okay";

       phy0: ethernet-phy@0 {
               #address-cells = <1>;
               #size-cells = <0>;

               reg = <0>;
               fiber-mode;
               vsc8584,los-active-low;
               sfp = <&sfp_eth0>;

               sfp0_eeprom: eeprom@50 {
                       compatible = "atmel,24c02";
                       reg = <0x50>;
                       read-only;
               };

               sfp0_eeprom_ext: eeprom@51 {
                       compatible = "atmel,24c02";
                       reg = <0x51>;
                       read-only;
               };
       };

       phy1: ethernet-phy@1 {
               #address-cells = <1>;
               #size-cells = <0>;

               reg = <1>;
               fiber-mode;
               vsc8584,los-active-low;
               sfp = <&sfp_eth1>;

               sfp1_eeprom: eeprom@50 {
                       compatible = "atmel,24c02";
                       reg = <0x50>;
                       read-only;
               };

               sfp1_eeprom_ext: eeprom@51 {
                       compatible = "atmel,24c02";
                       reg = <0x51>;
                       read-only;
               };
       };
};

Conclusion

While we are still working on pushing all of this work upstream, we’re happy to have been able to work on these topics. Do not hesitate to reach out of to us if you have projects that involve Linux and SFP modules!

Feedback from the Netdev 2.1 conference

At Bootlin, we regularly work on networking topics as part of our Linux kernel contributions and thus we decided to attend our very first Netdev conference this year in Montreal. With the recent evolution of the network subsystem and its drivers capabilities, the conference was a very good opportunity to stay up-to-date, thanks to lots of interesting sessions.

Eric Dumazet presenting “Busypolling next generation”

The speakers and the Netdev committee did an impressive job by offering such a great schedule and the recorded talks are already available on the Netdev Youtube channel. We particularly liked a few of those talks.

Distributed Switch Architecture – slidesvideo

Andrew Lunn, Viven Didelot and Florian Fainelli presented DSA, the Distributed Switch Architecture, by giving an overview of what DSA is and by then presenting its design. They completed their talk by discussing the future of this subsystem.

DSA in one slide

The goal of the DSA subsystem is to support Ethernet switches connected to the CPU through an Ethernet controller. The distributed part comes from the possibility to have multiple switches connected together through dedicated ports. DSA was introduced nearly 10 years ago but was mostly quiet and only recently came back to life thanks to contributions made by the authors of this talk, its maintainers.

The main idea of DSA is to reuse the available internal representations and tools to describe and configure the switches. Ports are represented as Linux network interfaces to allow the userspace to configure them using common tools, the Linux bridging concept is used for interface bridging and the Linux bonding concept for port trunks. A switch handled by DSA is not seen as a special device with its own control interface but rather as an hardware accelerator for specific networking capabilities.

DSA has its own data plane where the switch ports are slave interfaces and the Ethernet controller connected to the SoC a master one. Tagging protocols are used to direct the frames to a specific port when coming from the SoC, as well as when received by the switch. For example, the RX path has an extra check after netif_receive_skb() so that if DSA is used, the frame can be tagged and reinjected into the network stack RX flow.

Finally, they talked about the relationship between DSA and Switchdev, and cross-chip configuration for interconnected switches. They also exposed the upcoming changes in DSA as well as long term goals.

Memory bottlenecks – slides

As part of the network performances workshop, Jesper Dangaard Brouer presented memory bottlenecks in the allocators caused by specific network workloads, and how to deal with them. The SLAB/SLUB baseline performances are found to be too slow, particularly when using XDP. A way from a driver to solve this issue is to implement a custom page recycling mechanism and that’s what all high-speed drivers do. He then displayed some data to show why this mechanism is needed when targeting the 10G network budget.

Jesper is working on a generic solution called page pool and sent a first RFC at the end of 2016. As mentioned in the cover letter, it’s still not ready for inclusion and was only sent for early reviews. He also made a small overview of his implementation.

DDOS countermeasures with XDP – slides #1slides #2 – video #1video #2

These two talks were given by Gilberto Bertin from Cloudflare and Martin Lau from Facebook. While they were not talking about device driver implementation or improvements in the network stack directly related to what we do at Bootlin, it was nice to see how XDP is used in production.

XDP, the eXpress Data Path, provides a programmable data path at the lowest point of the network stack by processing RX packets directly out of the drivers’ RX ring queues. It’s quite new and is an answer to lots of userspace based solutions such as DPDK. Gilberto andMartin showed excellent results, confirming the usefulness of XDP.

From a driver point of view, some changes are required to support it. RX hooks must be added as well as some API changes and the driver’s memory model often needs to be updated. So far, in v4.10, only a few drivers are supporting XDP.

XDP MythBusters – slides – video

David S. Miller, the maintainer of the Linux networking stack and drivers, did an interesting keynote about XDP and eBPF. The eXpress Data Path clearly was the hot topic of this Netdev 2.1 conference with lots of talks related to the concept and David did a good overview of what XDP is, its purposes, advantages and limitations. He also quickly covered eBPF, the extended Berkeley Packet Filters, which is used in XDP to filter packets.

This presentation was a comprehensive introduction to the concepts introduced by XDP and its different use cases.

Conclusion

Netdev 2.1 was an excellent experience for us. The conference was well organized, the single track format allowed us to see every session on the schedule, and meeting with attendees and speakers was easy. The content was highly technical and an excellent opportunity to stay up-to-date with the latest changes of the networking subsystem in the kernel. The conference hosted both talks about in-kernel topics and their use in userspace, which we think is a very good approach to not focus only on the kernel side but also to be aware of the users needs and their use cases.

Bootlin at the Netdev 2.1 conference

Netdev 2.1 is the fourth edition of the technical conference on Linux networking. This conference is driven by the community and focus on both the kernel networking subsystems (device drivers, net stack, protocols) and their use in user-space.

This edition will be held in Montreal, Canada, April 6 to 8, and the schedule has been posted recently, featuring amongst other things a talk giving an overview and the current status display of the Distributed Switch Architecture (DSA) or a workshop about how to enable drivers to cope with heavy workloads, to improve performances.

At Bootlin, we regularly work on networking related topics, especially as part of our Linux kernel contribution for the support of Marvell or Annapurna Labs ARM SoCs. Therefore, we decided to attend our first Netdev conference to stay up-to-date with the network subsystem and network drivers capabilities, and to learn from the community latest developments.

Our engineer Antoine Ténart will be representing Bootlin at this event. We’re looking forward to being there!