In this installment of Real World Troubleshooting, I would like to tell you about a time in particular when I was working as the wireless engineer for a hospital, as in, I was an actual employee for once and not a contractor. This was the time the maternity ward was having a lot of issues with the Stanley HUGS Infant Protection System, colloquially known as the 'Baby LoJack'. The system is there to create a geo-fenced area where the babies are supposed to be and anything outside that area triggers alarms that not only alert the staff but also activate a series of events to further safeguard the infants such as shutting and locking doors to the unit, disable the elevators to the unit and call the police, who immediately have to run up four flights of stairs.
The event would occur randomly, with no infants being abducted, misplaced, stapled, folded, or bent. The event is a show-stopper, to say the least, and impacts not only work flow but patient safety as well. Stanley was brought in to evaluate their equipment and found nothing wrong with it. This is when the network team and myself specifically, got involved.
The very first thing I did was collected anecdotal data from the nurses and other staff that are impacted. The next thing I did was check the logs to see if there were any outages or anything else I might need to know about. The data collected from the nurses is usually, "it just happens for no reason", "its not any one device, we checked all the bracelets"... etc. One nurse had mentioned that "it's always in this room right here".
Wait, what..?
The event, when it occurs randomly, always happens to be an infant in a particular room. Now I have something to work with! I begin checking the access points near the room and start looking at uptimes and signal strength, etc. The closest access point to this room was right outside the door in the corner of the hall. This room is built out from the 90 degree angle that is the corner of the building. So imagine a giant square room that got lego'ed to the corner of a building. The signal strength in the furthest corner of the room with the door shut was -67dBm, which is sufficient primary coverage for the things we were doing at the time. That access point had been up for a very long time, it had recent and current associations to it; essentially, nothing wrong with it. I sat in that room for quite a while running a continuous ping to see if there were any drops in the link.
Finally, I asked the nurse where the crib goes in the room. Because of its size as a birthing suite, the crib could be located in two areas depending on what or who all else was in the room. The crib could be near the mothers bed or it could be near the window. The signal near the window was the weakest at -67dBm and the area by the bed was -64dBm, so the room had sufficient signal.
I took two laptops and set them up in the two spots to run a continuous ping to my office computer and let them sit for a few hours in any attempt to see if anything dropped connectivity, which up to now, wasn't happening. I came back after lunch and checked on the laptops. The one nearest the bed was still associated to the access point in the hall outside the door, but the other one, roamed to an access point upstairs to the 5th floor, with no dropped pings.
Raise your hand if you see where this is going...😉
All of a sudden it makes sense; the reason it was so random is the babies were not always in the same spot in the room and whenever the baby was placed near the window, the bracelet could (but not always did) associate to the floor above it, which was outside the geofenced area, which causes the chain of events that leads to the Great Unpleasantness.
Now that I know what's wrong, how do you fix it?
In this case the hospital was suffering with a Meru wireless system, which is a single-channel architecture that your CWNA book tells you to replace with a multi-channel architecture as a valid remediation of SCA issues. The transmit power is at full power at 20dBm and every access point is on channel 36, because the controller is what controls the clients' roaming, not the client. Adding another access point inside the room would be overkill, having two access points that close together. Moving the access point from the hall to the room would have fixed the one problem but caused another further down the hall. I decided the best way to resolve this was to have another access point installed in the room that was set to monitor mode only, which increased the signal inside the room, making the 5th floor access point less attractive, but not adding any client functionality that could have interfered with the rest of the environment.
Immediately after placing the access point, there were no more instances of vanishing babies and life was relatively good again... crisis averted for now.

Comments
Post a Comment