Design document for lossless compressing/decompressing data in Lua tables in /data
subpages in the Module namespace. Discuss here: WARFRAME Wiki talk:Data Compression.
- Last updated: Thu, 04 May 2023 21:41:45 +0000 (UTC) by User:Cephalon Scientia
Rationale[]
- Lua tables in Module namespace are not designed for storing large amounts of data. Lossless data compression may be needed to reduce the memory footprint of data that is stored on the wiki and to improve performance when accessing data.
- All articles have a maximum limit of 2,048 kibibyte (2,097,152 bytes) as defined by MediaWiki's $wgMaxArticleSize.
- Ideally, we should store this data in an actual database system (e.g. MongoDB) for performance reasons, but wiki editors will not be able to freely edit these like they would when editing articles for security reasons. An alternative solution would be to utilize a MediaWiki extension that interfaces with database backends like Semantic MediaWiki, but we are at the behest of the Fandom platform to implement such as feature.
- If we really have to, the only solution is to migrate the whole wiki to another platform with more control over the wiki backend just so we can store more data and have better performance when accessing and manipulating that data. This comes with its own set of issues that need to be resolved before a proper migration can be made.
- We can partition databases into smaller subpages like with Module:Weapons/data, but in the far future, this will suffer from the same article size limitation regardless of how one horizontally partitions the database contents. Data compression can help alleviate this issue, but even then you can only compress so much using human-readable UTF-8 characters and non-reserved Lua tokens.
Encoding Data Draft[]
The simplest compression algorithm would involve encoding keys to shorter names.
- Compression would involve:
- Translating unique symbols to a smaller text equivalent (aka encoding using human-readable UTF-8 characters and non-reserved Lua tokens).
- So only letter ASCII characters and underscore for encoding keys since they do not require additional syntax to be recognized as keys (i.e. no need for
["A"]="content"
, justA="content"
).- That means we can have up to 53 unique single letter keys (26 uppercase ASCII letters, 26 lowercase ASCII letters, and an underscore).
\65 to \90
(uppercase letters),\95
(underscore),\97 to \122
(lowercase letters)
- That means we can have up to 53 unique single letter keys (26 uppercase ASCII letters, 26 lowercase ASCII letters, and an underscore).
- We can also create 2,809 unique two-letter keys (two bytes each)
- We can use escape sequence
\ddd
within[""]
(between 6 to 8 bytes each) to use full ASCII and Extended ASCII character set (256 unique keys). - We can use emojis as key names within
[""]
(8 bytes each, single emoji takes up 4 bytes) for a maximum of 4,096 unique keys. As of September 2021, there are 3,633 emojis in Unicode Standard.- No simple way to programmatically denote emojis in Lua (e.g. via code points like U+2139 for ℹ or U+231B for ⌛)
- Extremely difficult to edit any source code with emojis due to the online editor not taking into account the spacing it takes up.
- So only letter ASCII characters and underscore for encoding keys since they do not require additional syntax to be recognized as keys (i.e. no need for
- Removing any whitespace that does not contribute to data (newlines, tabs, spaces).
- Creating a dictionary/map that maps encoded keys to their canonical name (e.g.
A="Name"
).
- Translating unique symbols to a smaller text equivalent (aka encoding using human-readable UTF-8 characters and non-reserved Lua tokens).
- No data decompression. Instead update how keys are accessed by using included dictionary to index canonical key names to their encoded ones.
- If possible, update
__index
metamethod to data to change how keys are indexed. That way we do not have to update every module's code, but the issue is that we have to still replacemw.loadData()
withrequire()
to import tables with metatables.
- If possible, update
- Uncompressed data will still be stored on the wiki in separate subpages for editors to edit; compressed data page should be a replacement to the original database but data access would be different because of encoded keys.
Can take a step further and encode duplicated values such as booleans and string "enums".
- Encoding can be as simple as storing a local variable that contains the value.
- (Global) variable declaration takes up at least 5 bytes before we go into two-letter variable names or storing strings (e.g.
a="";
).- We can declare multiple variables in one line, so at least 3 bytes for each declaration (variable name, two commas) (e.g.
a,b,c,d,e,f=true,false,"Common","Uncommon","Rare","Legendary"
).
- We can declare multiple variables in one line, so at least 3 bytes for each declaration (variable name, two commas) (e.g.
- Lua program will still use same memory since primitives are not stored like references, but the source code memory footprint will be smaller.
- (Global) variable declaration takes up at least 5 bytes before we go into two-letter variable names or storing strings (e.g.
- Booleans, item rarities, polarities, and mod type can be treated as enums.
- Booleans can be represented as 1 and 0, but Lua does not evaluate 0 as false unlike C++ or Python. Have to keep
true
andfalse
whether or not we encode booleans.
Example[]
Uncompressed (853 bytes):
return {
["Combat Reload"] = {
BaseDrain = 4,
Description = "If 5 pellets are head shots, increase reload speed by +120% for 3s.",
Image = "CombatReloadMod.png",
InternalName = "",
Introduced = "31.2",
CodexSecret = false,
IsWeaponAugment = true,
Link = "Combat Reload",
MaxRank = 5,
Name = "Combat Reload",
Polarity = "Naramon",
Rarity = "Rare",
Tradable = true,
Transmutable = false,
Type = "Tigris"
},
["Range Advantage"] = {
BaseDrain = 4,
Description = "+300% damage if no enemies are within 10m.",
Image = "RangeAdvantageMod.png",
InternalName = "",
Introduced = "31.2",
CodexSecret = false,
IsWeaponAugment = true,
Link = "Range Advantage",
MaxRank = 5,
Name = "Range Advantage",
Polarity = "Madurai",
Rarity = "Rare",
Tradable = true,
Transmutable = false,
Type = "Akjagara"
},
}
Only keys compressed w/ dictionary (679 bytes, 1.26:1 compression ratio w/ dictionary and 1.84:1 w/o; will get a better ratio with more data with repeated content):
return{["Combat Reload"]={A=4,B="If 5 pellets are head shots, increase reload speed by +120% for 3s.",C="CombatReloadMod.png",D="",E="31.2",F=false,G=true,H="Combat Reload",I=5,J="Combat Reload",K="Naramon",L="Rare",M=true,N=false,O="Tigris"},["Range Advantage"]={A=4,B="+300% damage if no enemies are within 10m.",C="RangeAdvantageMod.png",D="",E="31.2",F=false,G=true,H="Range Advantage",I=5,J="Range Advantage",K="Madurai",L="Rare",M=true,N=false,O="Akjagara"},_Dictionary={BaseDrain="A",Description="B",Image="C",InternalName="D",Introduced="E",CodexSecret="F",IsWeaponAugment="G",Link="H",MaxRank="I",Name="J",Polarity="K",Rarity="L",Tradable="M",Transmutable="N",Type="O"}}
LZW and Huffman Encoding Draft[]
We can serialize Lua tables into a string and apply LZW lossless compression and Huffman encoding to reduce file sizes.
- Issue is that we have to decompress and deserialize the data before being able to access it which adds onto CPU time.
- If we want "readable" compressed data (in order to be saved in Module namespace), we have to print out unprintable characters by escaping them through character codes (e.g.
\255
). Debug console does not print control characters (\000
through\031
), delete character (\127
), and extended ASCII (\128
to\255
).- Alternatively, encode compressed data in base64 so content is printable and savable on wiki.
Uncompressed data > LZW/Huffman compression > base64 encoding > compressed data
Compressed data > base64 decoding > LZW/Huffman decompression > uncompressed data
- Alternatively, encode compressed data in base64 so content is printable and savable on wiki.
- Compression algorithm can be implemented in Lua, JavaScript, or any other language (former two recommended so script can be hosted on the wiki in Module or MediaWiki namespace). Client will have to manually run the script and persist the output to the appropriate database page.
- If you want to automate the process, then you will have to have a bot or cron job watch changes to raw data, fetch data using MediaWiki's Action API, compress that data, encode the data to base64, and persist changes to the appropriate database page.
- May need access to private watchlist keys for this to work. Otherwise, you have to scrape the site which is not recommended.
- Can also create a webhook trigger that calls a cloud function whenever someone submits an edit to a database page. Cloud function would run a microservice to perform these series of actions to automate data compression. This trigger will likely be custom JS in the MediaWiki namespace and needs wiki admin and Fandom approval for usage.
- Requires third-party cloud hosting service and long-term maintenance. If the person maintaining the service is gone and does not pass the knowledge/authorization to future wiki admins/editors then the wiki loses this feature. Requires someone who is very familiar with cloud services which very few wiki editors have this area of expertise (since it is in a professional field).
- If you want to automate the process, then you will have to have a bot or cron job watch changes to raw data, fetch data using MediaWiki's Action API, compress that data, encode the data to base64, and persist changes to the appropriate database page.
- Decompression algorithm has to be implemented in Lua because that is the only way Lua modules can read compressed data. More focus on fast performance to lessen allocated CPU time usage.
Removing "Repeated" Data / Minimal Schema[]
Some properties of items can be derived from other properties if, and only if, we enforce certain data definitions. This will result in a smaller database schema that will sacrifice longer (marginal) computation time for smaller memory footprint.
For example, Image
and Link
values can be derived from Name
if:
- We standardize image names to be in the form of
<Name><Item>.png
. For example,Name = "Range Advantage"
can be manipulated to"RangeAdvantageMod.png"
.<Name>.png
could also be used, but you may encounter rare scenarios where two things share the same name like Scorch and Scorch (Mod). We can probably implement a fallback system where if a file like"Scorch.png"
doesn't exist, depending on the entry type, we can guess the file name to be"ScorchEnemy.png"
or"ScorchMod.png"
. But this means we have to blacklistFile:Scorch.png
as a page name so vandalism won't happen.
- Create proper disambiguation pages for articles that share the same name if
Name
values directly map toLink
. This will only affect a small number of articles and would just result in readers having to click an additional link to get to page content (bad for user experience, but this is a sacrifice we have to make). An example of an article name conflict would be Scorch the enemy vs Scorch (Mod) the mod.- A worse solution would be to create new redirect pages for all database entries and have
Name
be manipulated to match that redirect page name to get the canonicalLink
. For example,Name = "Range Advantage"
would be manipulated to be"Range Advantage (Mod)"
.
- A worse solution would be to create new redirect pages for all database entries and have
Further Reading[]
- Data compression on Wikipedia