Tropo suffered two incidents in the last week: a bug that resulted in users without international dialing permissions begin unable to dial US numbers unless they used the US country code format, and a partial outage late Friday night. We share incident reports with affected customers and post brief reports to our status page, but as part of our commitment to transparency, we want to highlight these two incidents, how they happened, and what we’re doing to prevent them in the future.
10 Digit Dialing Failure
In the first incident, some applications that used the 10-digit form of dialing a US number (i.e.
call('4155551212')) experienced error messages indicating they did not have sufficient permissions to dial that number. Accounts with permissions to dial outside the United States did not experience this issue.
First, some background information. Tropo is used both in our hosting (the Tropo.com site you’re looking at now) and by telephone carriers within their own networks, powering their own developer programs and internal product development. These telcos are located around the world, and Tropo’s validation of outgoing phone numbers was US-centric. A recent software release contained a feature that allowed fine-grained blacklists and whitelists instead of a simplistic determination of “US or not US.” It also provided a configuration that would transform telephone numbers to meet the local formatting. Thus, a Tropo installation in the UK can allow people to dial without the +44 UK country code, and Tropo will format it automatically.
A bug with these two features working in concert caused phone numbers that did not include a country code at all to be be incorrectly compared to the blacklist pattern, Due to this, all calls to US numbers, but with no country code, were incorrectly blocked. Accounts that had international permissions set had a broader whitelist of allowable numbers, allowing the calls to pass through.
Tropo’s production environment was updated with this release at 10:58 PM Pacific on October 8. Small numbers of calls began failing, but only on applications that were in production, used 10 digit dialing, and were under accounts that had restrictions about which countries were being dialed. Due to the low volume of calls in general at this time of night, and the relatively low volume of calls that match those parameters in general, the issue was not noticed by us or affected customers until the morning. As a temporary measure, we granted expanded permissions to affected accounts so that calls would be working again, and a workaround was implemented for Tropo as a whole at 12:09PM Pacific on Oct 9. The next release will contain more permanent fix.
Our QA team is located outside the US, and tests using local (to them) telephone numbers, When verifying the configuration of the software, they made a number of outgoing calls, but only to their local numbers, which required the full country code to be dialed. Our active monitoring also failed to detect the issue, due to the unique set of variables that all needed to be present in order to trigger the issue.
To prevent this from happening again, our tests are being extended to include all valid forms of US and non-US numbers. We will run these tests from the US or provide the QA group with US telephones. We are also investigating additional passive monitoring to better detect increased rates in outgoing calls that fail for any reason, to be able to catch an issue with this sooner.
In the second incident, a large number of Tropo servers in our Orlando datacenter became inaccessible outside of our network. Our monitoring caught the issue quickly, but it took some time to discover the source and extent of the issue.
At 11:20pm Pacific on Oct 11, monitoring discovered some servers serving SMS were not responding from outside the Tropo network. As engineers were attempting to determine the cause, alerts on other servers started appearing, with the majority starting between 11:45pm and 11:50pm Pacific. All servers hosting api.tropo.com were affected, so requests to our http apis like session control and number provisioning failed. The Tropo web site was up, but as it was unable to reach internal services, account management did not function. Our overnight support staff attempted to update status.tropo.com and found the login credentials did not work.
At 12:39am the cause was confirmed to be an error related to network maintenance by our Orlando datacenter provider that caused a network segment to lose access to the internet. While we have the ability to migrate all services to alternate data centers when this sort of work occurs, the provider had failed to notify us in advance of this work. We began datacenter failover efforts, but before we could implement them the hosting provider informed us that all issues were resolved. All services were tested and confirmed to be working by 2:25am Pacific October 12.
In response to this outage, we are working with data center management to understand why we did not receive notice of a major maintenance and reviewing their procedures for maintenance that has the potential for partial or full outages. We are looking at ways to improve our ability to transfer affected services to different locations. We already had plans to make certain services more geo-redundant with automatic failover, and we’ll be implementing those soon. Finally, we’re looking into why we were unable to post to the status page. We host this offsite on Tumblr, so that in the event of a catastrophic failure of all of our systems, the status page will not be affected, but in this case it appears that Tumblr may have had an issue at the same time Tropo did.
Our commitment to you is to provide the highest performing, most reliable means for your critical business communications. We take the continued uptime and performance of Tropo very seriously. If you have further concerns about these issues, please talk to Tropo Support, we’re here to help.