POSTED IN .NET, C#, DEVELOPMENT, OPENXML, XML
One of the shortcomings of Microsoft’s OpenXml SDK is its support for large files. The specification for OpenXml itself allows for a spreadsheet with 16,384 columns by 1,048,576 rows. If all of these elements were to be stored in memory, even on high end systems, an out of memory exception would occur.
Recently, I was asked to write a system that would handle writing these types of file. OpenXml was my first choice in libraries, but how to get it to handle this much data without storing it in memory. In this post I hope to demonstrate how to use an OpenXmlReader and an OpenXmlWriter to get the job done with a low memory footprint. The code that I will demostrate will accept a SQL query and produce an Excel document of that data with only the OpenXml specs as a limitation (actually the file will be produced with any number of records, but Excel will not read the file if the limits are exceeded).
To get started we need a template document. Since I don’t want my application to be dependent on a file existing, I am going to write my template file as needed. This code is going to use the regular DOM elements for OpenXML, in other words it is type save and will be created in memory.
public static void CreateSpreadsheetWorkbook(string filepath) { // Create a spreadsheet document by supplying the filepath. // By default, AutoSave = true, Editable = true, and Type = xlsx. SpreadsheetDocument spreadsheetDocument = spreadsheetDocument.Create(filepath, SpreadsheetDocumentType.Workbook); // Add a WorkbookPart to the document. WorkbookPart workbookpart = spreadsheetDocument.AddWorkbookPart(); workbookpart.Workbook = new Workbook(); //This section may be needed at a later time, if the users want to style //The XLSX file that is created by the application //Dim workbookStylesPart As WorkbookStylesPart = workbookpart.AddNewPart(Of WorkbookStylesPart)() //workbookStylesPart.Stylesheet = CreateStylesheet() //workbookStylesPart.Stylesheet.Save() // Add a WorksheetPart to the WorkbookPart. WorksheetPart worksheetPart = workbookpart.AddNewPart<WorksheetPart>(); worksheetPart.Worksheet = new Worksheet(new SheetData()); // Add Sheets to the Workbook. Sheets sheets = spreadsheetDocument.WorkbookPart.Workbook.AppendChild<Sheets>(new Sheets()); // Append a new worksheet and associate it with the workbook. Sheet sheet = new Sheet(); sheet.Id = spreadsheetDocument.WorkbookPart.GetIdOfPart(worksheetPart); sheet.SheetId = 1; sheet.Name = "ExportData"; sheets.Append(sheet); workbookpart.Workbook.Save(); // Close the document. spreadsheetDocument.Close(); }
To create this template, we need to create a SpreadsheetDocument class; add a workbookpart to it; add a worksheet part to that; add a sheets collection to that that; finally add an empty sheet; save; and close. This approach will ensure that we always have a correctly formatted document to start with.
After our template has been created we can create our method that generates a document based upon a database command. Because of the nature of the request from my end user, the example that I am going to show connects to a Teradata server to get its data. This could be replaced with any connection that you desire, or it could be extended to support any connection and/or query. As always, I am going to throw the code into the wild then go back and explain it.
public static bool GetDataResults(string ConnectionString, string SomeId, string filePath, string templatePath) { if ((!File.Exists(templatePath))) { //If we do not have a template file, we need to create one CreateSpreadsheetWorkbook(templatePath); } //Copy the template file to the output file location File.Copy(templatePath, filePath, true); //Open the copied file using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filePath, true)) { //Navigate to the workbook part WorkbookPart workbookPart = myDoc.WorkbookPart; //open the first worksheet WorksheetPart worksheetPart = workbookPart.WorksheetParts.First(); //Get the id of this sheet. We need this because we are going to add a new //worksheet to this workbook, then we are going to delete this worksheet //This ID will tell us which one to delete string origninalSheetId = workbookPart.GetIdOfPart(worksheetPart); //Add the new worksheet WorksheetPart replacementPart = workbookPart.AddNewPart<WorksheetPart>(); //This is the ID of the new worksheet string replacementPartId = workbookPart.GetIdOfPart(replacementPart); //We are going to read from the original worksheet to get the //templated items that we added to the worksheet in the traditional way OpenXmlReader reader = OpenXmlReader.Create(worksheetPart); //We are goint to copy the items from the original worksheet by using //an XML writer, this overcomes the memory limitations of having //an extremely large dataset. OpenXmlWriter writer = OpenXmlWriter.Create(replacementPart); //The template does not have any data so we will be creating new rows and cells //Then writing them using XML. Row r = new Row(); Cell c = new Cell(); while ((reader.Read())) { //This iterates through the sheet data and copies it if ((object.ReferenceEquals(reader.ElementType, typeof(SheetData)))) { //Exit the loop if we hit a sheetdata end element if ((reader.IsEndElement)) break; // TODO: might not be correct. Was : Exit While //We create a new sheetdata element (basically this is the reoot container for a sheet) writer.WriteStartElement(new SheetData()); //Create a connection to teradata using (TdConnection con = new TdConnection(ConnectionString)) { //Create a command that runs the macro using (TdCommand cmd = new TdCommand("EXEC Get_Data_Results('" + SomeId + "')", con)) { cmd.Connection = con; cmd.CommandType = CommandType.Text; cmd.CommandTimeout = 600; con.Open(); //We use a data reader because of the memory usage of creating a data set //from an extremely large set of data. TdDataReader dr = cmd.ExecuteReader(CommandBehavior.CloseConnection); int rowIndex = 1; r.RowIndex = rowIndex; //Start the first row. writer.WriteStartElement(r); //Iterate through the columns to write a row conaining the column names for (int columnIndex = 0; columnIndex <= dr.FieldCount - 1; columnIndex++) { c = CreateCell(columnIndex, dr.GetName(columnIndex), rowIndex); writer.WriteElement(c); } //End first row writer.WriteEndElement(); //Subsequent rows contain the data (the first row is just for column names) while ((dr.Read())) { rowIndex = rowIndex + 1; r.RowIndex = rowIndex; //write the start of the row writer.WriteStartElement(r); //iterate through the columns to write their data for (int columnIndex = 0; columnIndex <= dr.FieldCount - 1; columnIndex++) { if ((Information.IsDBNull(dr.Item(columnIndex)))) { //If we have null data, we write the empty string c = CreateCell(columnIndex, string.Empty, rowIndex); } else { //Otherwise we write the data c = CreateCell(columnIndex, dr.GetString(columnIndex), rowIndex); } writer.WriteElement(c); } //write the end of the row writer.WriteEndElement(); } dr.Close(); } } writer.WriteEndElement(); } else if ((reader.IsStartElement)) { //Start elements are directly copied writer.WriteStartElement(reader); } else if ((reader.IsEndElement)) { //End elements are directly copied writer.WriteEndElement(); } } //Close the reader and writer reader.Close(); writer.Close(); //Get the newly created sheet (same id as the old sheet, but it is the first one) Sheet sheet = workbookPart.Workbook.Descendants<Sheet>().Where(s => s.Id.Value.Equals(origninalSheetId)).First(); //Assign it the new sheet id sheet.Id.Value = replacementPartId; //remove the old sheet workbookPart.DeletePart(worksheetPart); //Done myDoc.Close(); } }
After creating a copy of the template generated earlier, this method opens the copy (SpreadsheetDocument), adds a new WorksheetPart (where we store the date), pulls the original WorksheetPart as a reader (where we read the structure). It keeps references to all of these items because in the end it will replace the the original item with the new item. Using the OpenXmlReader it begins to read the original items structure and writing this structure to the new item.
} else if ((reader.IsStartElement)) { //Start elements are directly copied writer.WriteStartElement(reader); } else if ((reader.IsEndElement)) { //End elements are directly copied writer.WriteEndElement(); }
If however the element read is of type SheetData and we have not reached the end of the document, then we can start doing our work.
while ((reader.Read())) { //This iterates through the sheet data and copies it if ((object.ReferenceEquals(reader.ElementType, typeof(SheetData)))) { //Exit the loop if we hit a sheetdata end element if ((reader.IsEndElement)) break; //We create a new sheetdata element (basically this is the reoot container for a sheet) writer.WriteStartElement(new SheetData());
We begin by connecting to our database, if you want to have a header row with names, you need to first iterate over the columns in the returned set and write them as columns in a new row.
r.RowIndex = rowIndex; //Start the first row. writer.WriteStartElement(r); //Iterate through the columns to write a row conaining the column namesfor (int columnIndex = 0; columnIndex <= dr.FieldCount - 1; columnIndex++) { c = CreateCell(columnIndex, dr.GetName(columnIndex), rowIndex); writer.WriteElement(c); } //End first row writer.WriteEndElement();
Next we can iterate over the datareader that we open from the query executed. This allows us to write cells using a writer from a datareader (we are also still have the OpenXmlReader open, but we are waiting for our data to finish to complete its task of adding the rest of the document elements).
//Subsequent rows contain the data (the first row is just for column names) while ((dr.Read())) { rowIndex = rowIndex + 1; r.RowIndex = rowIndex; //write the start of the row writer.WriteStartElement(r); //iterate through the columns to write their data for (int columnIndex = 0; columnIndex <= dr.FieldCount - 1; columnIndex++) { if ((Information.IsDBNull(dr.Item(columnIndex)))) { //If we have null data, we write the empty string c = CreateCell(columnIndex, string.Empty, rowIndex); } else { //Otherwise we write the data c = CreateCell(columnIndex, dr.GetString(columnIndex), rowIndex); } writer.WriteElement(c); } //write the end of the row writer.WriteEndElement(); }
Once we have finished writing all of our data, we need to close everything down (making sure to write Xml closing tags for all open tags) and replace the old workbookPart with the new one.
//Close the reader and writer reader.Close(); writer.Close(); //Get the newly created sheet (same id as the old sheet, but it is the first one) Sheet sheet = workbookPart.Workbook.Descendants<Sheet>().Where(s => s.Id.Value.Equals(origninalSheetId)).First(); //Assign it the new sheet id sheet.Id.Value = replacementPartId; //remove the old sheet workbookPart.DeletePart(worksheetPart); //Done myDoc.Close();
Here are some helper methods that were used in the code above.
private static string GetExcelColumnName(int columnNumber) { //This gets the column name (ie. A1, B4, AA21, etc...) int dividend = columnNumber; string columnName = String.Empty; int modulo = 0; while (dividend >= 0) { //twenty-six letters in the alphabet modulo = (dividend) % 26; //Append letters until we have accounted for all digits columnName = Convert.ToChar(65 + modulo).ToString() + columnName; //Get the remainder dividend = Convert.ToInt32((dividend - modulo) / 26) - 1; } return columnName; }
private static Cell CreateCell(int columnNumber, string value, int index) { Cell formattedCell = new Cell(); formattedCell.DataType = CellValues.InlineString; formattedCell.CellReference = GetExcelColumnName(columnNumber) + index; //formattedCell.StyleIndex = 4 InlineString inlineString = new InlineString(); Text t = new Text(); t.Text = value; inlineString.AppendChild(t); formattedCell.AppendChild(inlineString); //End If return formattedCell; }