It’s become a tradition in the SQL blogs to start a month long series on a certain topic and try to blog every day. Today I want to start my first series on SSD technologies. It has been over a year since I started speaking on this topic and as this is a hot topic there are new technologies that I feel are not well explained even on specialty blogs.
I will keep this as a master post and will add links to each of the posts as I publish them.
One of the main drawbacks of SSD has been reliability. Every NAND cell has a certain prescribed number of Program/Erase (P/E) cycles and as data is being written to disk, chances are it will remain unchanged for weeks or months. That means that the cells that are being used to store that data will have the same wear level (used P/E cycles) for the weeks or months that data was unchanged. This becomes a problem because the remaining free cells are going to be taxed even more and could reach their end of life and make the entire drive read only or even fail it completely.
I discovered this technology while I was trying to explain the degraded performance on my new OCZ Vertex 3 SSD drive. I ran a bunch of tests using SQLIO based on Jonathan Kehayias (Blog|Twitter) post about Parsing SQLIO Output to Excel Charts using Regex in PowerShell with a 6GB file and I got some good results. I started using the drive and installed a few VMs until 50% of the drive was full. At that point I kept running SQLIO and Crystal Disk Mark test only to see the performance sinking more and more.
Little did I know that OCZ Vertex 3 which is based on SandForce 2281 chipset implements an intelligent Static Data Rotation algorithm as part of Duraclass (Sandforce’s set of technologies to increase the reliability of the drive). This means that the SSD controller actively rotates static data from cells intensively used to other cells that were least used during idle periods to allow the drives wear leveling to work at it’s best. But what happens when you stress test the disk and you run the about 3 times the size of the drive worth of data in a couple hours while half of the drive is full. The Sandforce Duraclass algorithm will kick in and start moving data around even when the drive is not idle and the user will see a decrease in performance until the wear level is stabilized.
Essentially Static Data Rotation is there to make sure that you can use the drive for the MTTF prescribed by the manufacturer and prevents premature wear on the cells that store hot data.
UPDATE: Nitin Salgar (b|t) Has asked avery good question on Twitter after reading my post:
“Is Static Data Rotation in SSD a common phenomenon across all manufacturers?”
The answer is no, this is one of the strong selling points for the newer Sandforce SSD controllers that implement Duraclass. Newer Intel controllers have this technology as well but older ones do not have it. I would like to think that any Enterprise class controller has its own implementation of a Static Data Rotation algorithm.
It’s time dissect the two main types of flash chips in order to understand why not all SSDs are created equal. What is after all the physical difference between SLC and MLC?
SLC stands for Single Level Cell and just like the name suggests can store one bit per NAND gate hence SLC cell has two states:
0 or 1 based on the charge of the NAND gate.
MLC on the other hand stands for Multi-Level Cell and uses multiple voltage threshold levels in order to store 2 or even 3 bits (also called TLC – Triple-Level Cell) in the same NAND gate. this is done by coding 4 or even 8 states (in the case of 3 bit TLC) on the same gate so the MLC will typically one of the following states :
11, 10, 01, 00). The benefit over SLC is the increased capacity per chip (2 or 3 times more) but at the same time the voltage reference levels are a lot tighter which leads to more rapid degradation of the cell after a lot of P/E (Program/Erase) Cycles. Once the MLC NAND gate has degraded the reads are no longer predictable because the stored value overlaps reference levels. In this case the memory will report an error or if the controller supports it it will retire the cell and replace it with one from the reserve capacity.
Typical number of write cycles is pretty solid around 100K for SLC and floats around 10K for MLC (different dies can have very different quality and will wear differently). This number is still high enough for a consumer lifecycle in the case of MLC if the entire memory is programmed 5 times daily for 5 years and runs uptu 50 years for SLC under the same usage.
Type of flash cell
Typical capacity /chip
Endurance P/E cycles
Performance Over Time
Thumb drives,Camera cards
In the case of MLC the program cycle take 2 or 3 times more than for SLC since the programming signal has to be a lot more precise to code 4 states in the space of 2. This leads to higher speed and increased number of IOPS (IO Operations Per Second) for SLC type of memory compared to MLC.
One of the limitations of flash memory is that while it can be read or programmed a byte or a word at a time in a random access fashion just like regular RAM, it can only be erased a “block” at a time. This will set all bits in the block to 1 which is the default state for NAND memory.
Writing a byte in flash memory involves 2 steps: Program and Erase (P/E). The block is written to a new cell and the old block needs to be erased.
The programing can be done at cell level (setting it to the “0” state) via a process called tunneling while the floating gate is being flooded with high voltage using the on-chip charge pumps.
Erasing can be done only on an entire block (resetting it to the “1” state), thru high negative voltage pulling the electrons off the floating gate via process called quantum tunneling. Flash memory is divided in erase segments (often called blocks or sectors).
RAISE is a technology developed by Sandforce that stands for Redundant Array of Independent Silicon Elements. It is based on RAID ( Redundant Array of Independent Disks) technology and is used to protect against write errors. Sandforce controllers that implement this technology work in a way that is very similar to RAID 5. Every chip contains a number of dies (typically 8 or 16). Each die is the equivalent of a HDD in a RAID5 array and data is being spread across multiple dies and to enable recovery from a failure in a sector, page or entire block, the missing data can be calculated calculated from parity and the write is performed again in the same block. For more information read the article on Sandforce website.
You can use Crystal DiskInfo to check the number of errors where the Raise technology recovered from.
Wear leveling is a technology used in Solid State Drive controllers to prolong the service life of flash memory. As mentioned in the 2nd post of this blog series What’s the difference between SLC and MLC? flash memories have limited endurance,measured by the number of P/E cycles that the memory can perform before becoming degraded. Wear leveling ensures that all cells are getting the same number of P/E cycles (even wear) so that you do not have just a few cells on the drive receive the majority of the writes and wear out early. This might cause the drive to fail when most of the memory on it is still usable way ahead of the prescribed service life.
Memory wear-out concerns are unique to flash-based memory. Hard disks store data by magnetizing a thin film of ferromagnetic material on a disk. DRAM is volatile memory (the memory stores data only while it is powered on). Flash memory stores data inside the NAND cell via a process called tunneling while the floating gate is being flooded with high voltage. This leaves a charge in the NAND cell and that charge can be read over and over. Because of this invasive writing method and coresponding erasing method, the flash cells degrade over time.
Wear leveling algorithm basically stores the P/E count of each cell and writes the next block to the “least used available cell” so that cells that were intensively used are put to the end of the queue until all cells have an even wear.
One caveat is that new disks will have a much better performance than the once that were used intensively because all cells are good candidates for writing. Once the disk has been used, there is a performance degradation due to the need to erase the cell selected by wear leveling. So a good advice would be not to test a new drive but first perform writes on 2-3 times the capacity of the drive before starting tests(e.g. 240GB SSD drive should have 500-750GB lifetime writes) in order to simulate the a real production scenario instead of just benchmarking a brand new disk. One other thing that affects this performance is Static Data Rotation which is discussed in the first post of the series. Lifetime writes can be queried using Cristal DiskInfo and other free tools.
CacheCade is a technology developed by LSI for its MegaRAID Storage controllers. CacheCade software allows you to mix inexpensive SATA or even SAS hard disk drives with up to 512GB of solid state storage capacity distributed over a few SSD drives, to to provide a substantial performance boost, adding additional SATA HDDs or moving to an all SSD RAID volume to achieve performance requirements.
This combination of HDDs and SSDs as secondary cache is especially best suited in random read intensive applications where hot data can be moved to SSD storage in order to take advantage of the low latency, High IOPS characteristics of SSD at a reasonable price.
This technology is available on LSI MegaRAID 9260 and 9280 controller series as well as on re-badged RAID controllers like Dell PERC H700 and H800 with 1GB Cache.
While CacheCade version 1.o offers only read cache (only one supported by DELL) CacheCade 2.o offers read and write caching for impressive results. This technology requires a inexpensive hardware license ($300). Read more about this here
In previous posts I talked about the wear and RAISE algorithms implemented by SSD controllers. One of the inevitable issues solid state memory is how to gracefully deal with bad blocks.
Bad blocks are flash memory blocks that contain one or more invalid bits whose reliability is not guaranteed because of faulty dies, over-charge leaks or wear-out. Bad blocks may exist even on a new disk.
Bad Block Management or Intelligent Block Management is an algorithm that monitors and maintains bad blocks within the NAND device. The controller maintains tables of known bad blocks and can replace the new bad blocks with spare blocks that are reserved for use. typically 4-10% of the usable capacity is reserved for Bad Block Management This practice further enhances the overall SSD lifespan and ensures that a few bad blocks will not affect the integrity of the drive and the SSD device still operate.
Bad blocks are mapped as “do not use”and substituted with known good blocks from the percentage set aside for spare blocks. The percentage of bad blocks which can be accomodated is a product marketing decision and that is the reason why an 128GB SSD device will only present 120GB as available to use. The spare blocks come from over provisioning inside the SSD and using the capacity which is invisible to the user.
If the bad blocks exceed the remaining spare blocks for any reason- the SSD fails because the controller cannot safely substitute a block that has to be written on a new bad block and can result in data loss.
The measure for Bad Block Management is “Retired Block Count” but expect that count to go up as “SSD Life Left” nears 5-10%
In the 2nd post off this series we have explored the differences between SLC and MLC and saw that the main issue with MLC is endurance which in the past prevented their use for enterprise applications. Because of the increased capacity of MLC is a right fit for enterprise use, flash memory manufacturers looked for a way to increase the endurance characteristic of MLC memories. When analyzing the reasons why MLC cells fail sooner than SLC the main culprit was the tight reference voltage that after a numbur of flash write cycles is being overlapped by the actual charge left in the cell leading to an incorrect value being read from that cell. When that happens a few times the cell is being marked as bad.
The solution was to try to make the programing cycles more precise in order to increase the interval around reference levels from and allow more room for error when the memory cell degrades. Also the silicon dies are being tested and only the ones having better endurance characteristics are selected for enterprise use. This flash memory has been marketed as eMLC or MLC-HET (high endurance).
This memory has improved endurance over consumer grade MLC with one downside that programming increases in order to allow for more precise reference levels.
The average Write cycles for this type of memory is between 10K and 30K times.
This post comes to you from the shores of Lake Michigan in Muskegon where we are spending the weekend at a cottage in the company of good friends. In order to continue this SSD saga I found myself forced to write this using the WordPress iPhone app much like the character played by Robin Williams in the movie RV. Please excuse any spelling errors that you might find.