Angband.oook.cz
Angband.oook.cz
AboutVariantsLadderForumCompetitionComicScreenshotsFunniesLinks

Go Back   Angband Forums > Angband > Development

Reply
 
Thread Tools Display Modes
Old March 26, 2016, 04:44   #21
takkaria
Veteran
 
takkaria's Avatar
 
Join Date: Apr 2007
Posts: 1,936
Donated: $40
takkaria is on a distinguished road
Quote:
Originally Posted by AnonymousHero View Post
For the love of all that's holy, don't do that. It's hugely disruptive to anyone who currently has a clone somewhere and you have absolutely no idea how many people do. You should only even be considering rewriting (published) history in extreme scenarios such as: Someone accidentally published the nuclear launch codes (and the codes themselves cannot easily be changed), or there's some kind of legal threat where removing files at the "master" revision is not acceptable to the other party.

Just use a separate repository + the "submodule" functionality of git. That would mean that fetching the sounds is a simple "git submodule init && git submodule update" command away.
Thanks for this, I had a vague intuition it would be a bad idea but it's nice to have it laid out. The thing is that moving the mp3s into a submodule won't remove them when cloning the repository, which means that removing them is sort of pointless. They might as well stay there with a separate download of oggs for Linux users.
__________________
takkaria whispers something about options. -more-
takkaria is offline   Reply With Quote
Old March 26, 2016, 04:52   #22
Nick
Vanilla maintainer
 
Nick's Avatar
 
Join Date: Apr 2007
Location: Canberra, Australia
Age: 54
Posts: 7,835
Donated: $60
Nick will become famous soon enough
Quote:
Originally Posted by AnonymousHero View Post
(Btw, I'm not sure if you're already familiar with it, but the tool is probably named after the BFG 9000 from Doom. That should tell you something about when it's appropriate to use it.)
IIRC that was always, as soon as you acquired it
__________________
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
Nick is offline   Reply With Quote
Old March 26, 2016, 04:53   #23
AnonymousHero
Veteran
 
AnonymousHero's Avatar
 
Join Date: Jun 2007
Posts: 1,367
AnonymousHero is on a distinguished road
Quote:
Originally Posted by calris View Post
The AVL tree code I shamelessly stole from here:

http://piumarta.com/software/tree/tree-1.0/
(EDIT: Nevermind, seems to be a BSD/MITish license, but it really needs to be checked.)

Quote:
Originally Posted by calris View Post
Basically, all operations (insert, find, delete) are O(LogN). Anywhere we currently used linked-list to store large amounts of data (especially when we are constantly adding and removing data) with a single search key can be replaced by a tree - they are much faster.
Firstly, you almost never want to use linked lists. Not even if you don't care about performance. (There are plausible use cases for so-called "intrusive" linked lists, but that's an aside.) EDIT: I realize you didn't advocate for linked lists, just pointing it out for those reading along .

Secondly, "lower algorithmic complexity" does not automatically translate to "faster". (Thus, it's important to not conflate the two!). Constant factors of course also matter a lot, but even more importantly: The tradeoff between RAM latency vs. throughput on modern CPUs means that any kind of pointer-chasing is penalized to an extreme degree. (There are so-called cache-oblivious algorithms/data structures which can alleviate many of these things, but often at a cost of greatly increased implementation complexity.)

Thirdly, while you're right about the algorithmic complexity of (AVL) trees, this is not the end of the story.
Unless you have huge[0] numbers of items and/or truly need the range support of trees, you're almost always
better of with a simple flat array of as-flat-as-possible structs. For searching simply do a linear search[1]. The reason this is almost always better is simple: CPUs these days a) have huge caches, and b) do prefetching like you wouldn't believe. (Those two things mean that any kind of intensive pointer-hopping -- like that required by trees -- incurs a huge latency penalty wrt. accessing RAM.)

I believe the scenario you're describing here is: a) data is only loaded/inserted once, and b) you're looking at <= 10000 entries (so "not huge"), and c) you don't need huge amounts of in-line data for each entry.

I would recommend something like the following:

Code:
   struct entry {
       uint32_t key_hash;   // Some simple hash of the string in "key"
       const char *key;   
       // ...
       uint8_t *sound_data;    // I'm assuming loaded sound data would go here
   }
Load all your data into a simple array of "struct entry". When you need to find something simply hash your lookup key, do a linear search for a matching "key_hash", making sure the real key actually matches (otherwise, skip to next).

I'm basing the use of "key_hash" on the assumption that you want to look things up by name and the fact that you'd need to do pointer-chasing to compare anything against "key" (plus strcmp is linear-time, so you want to avoid it in an inner loop). If that's not the case and you have some simple constant-size key, just use the key directly.

Since the sound data is huge, it doesn't matter that we access that through a pointer indirection. (Heck, you could probably load it from file every time you need it.)

Fourthly, Most non-trivial tree algorithms are much more complicated implementation-wise than a simple linear search. Unless you have a good test harness (e.g. QuickCheck or something like that), you're likely to have bugs in such an implementation.

[0] Obviously what's "huge" varies and you'd really need benchmarks to pinpoint the exact number in this exact case. Plus it may depend on cache sizes, etc. However, a good ball-park estimate might be anything from 10000-1000000 depending on the size of the "struct" you need to store.

[1] Even sorting the array and doing a binary search would likely be slower than linear search simply due to the pointer-chasing that this incurs.

EDIT: I should say... none of this probably matters very much on modern machines for the number of elements we're talking about (and because lookups would be so rare, comparatively speaking), so I'd say the most important thing here is: Do the simplest thing possible (i.e. avoid bugs)... which coincidentally is also an array .

Last edited by AnonymousHero; March 26, 2016 at 06:00.
AnonymousHero is offline   Reply With Quote
Old March 26, 2016, 04:54   #24
AnonymousHero
Veteran
 
AnonymousHero's Avatar
 
Join Date: Jun 2007
Posts: 1,367
AnonymousHero is on a distinguished road
Quote:
Originally Posted by Nick View Post
IIRC that was always, as soon as you acquired it
Heh! Are you channelling debo, right now?

(I guess when you're alone against the world, the collateral damage doesn't matter that much, but when you do care who else gets hit... watch out!)

Last edited by AnonymousHero; March 26, 2016 at 05:04.
AnonymousHero is offline   Reply With Quote
Old March 26, 2016, 04:59   #25
AnonymousHero
Veteran
 
AnonymousHero's Avatar
 
Join Date: Jun 2007
Posts: 1,367
AnonymousHero is on a distinguished road
Quote:
Originally Posted by takkaria View Post
Thanks for this, I had a vague intuition it would be a bad idea but it's nice to have it laid out. The thing is that moving the mp3s into a submodule won't remove them when cloning the repository, which means that removing them is sort of pointless. They might as well stay there with a separate download of oggs for Linux users.
Right, but removing/moving them in the main repo would mean that you could clone with "--max-depth=1" and avoid downloading them. (Plus, it would mean that further revisions don't take up space in the main repository.)

I would think that's enough motiation to do it even if it would mean that people actively working on sound (few and far between it seems) would effectively have to download them twice[1].

[1] Well, they'd have to download all revisions pre-move (unless using --max-depth) (x1), the exact revisions that were moved (x2), and all revisions post-move (x1).

Last edited by AnonymousHero; March 26, 2016 at 05:05.
AnonymousHero is offline   Reply With Quote
Old March 26, 2016, 06:09   #26
takkaria
Veteran
 
takkaria's Avatar
 
Join Date: Apr 2007
Posts: 1,936
Donated: $40
takkaria is on a distinguished road
Quote:
Originally Posted by AnonymousHero View Post
Right, but removing/moving them in the main repo would mean that you could clone with "--max-depth=1" and avoid downloading them. (Plus, it would mean that further revisions don't take up space in the main repository.)

I would think that's enough motiation to do it even if it would mean that people actively working on sound (few and far between it seems) would effectively have to download them twice[1].

[1] Well, they'd have to download all revisions pre-move (unless using --max-depth) (x1), the exact revisions that were moved (x2), and all revisions post-move (x1).
Aaah. I didn't know about the depth setting. That makes tons more sense now, thanks.
__________________
takkaria whispers something about options. -more-
takkaria is offline   Reply With Quote
Old March 26, 2016, 06:30   #27
calris
Adept
 
Join Date: Mar 2016
Posts: 194
calris is on a distinguished road
Quote:
Originally Posted by AnonymousHero View Post
(Assuming you linked to the correct place and unless you wrote that code yourself: ) DANGER! That code does not appear to have any kind of license and that's hugely problematic.
It does have a license, and it is as permissive as they come:

Code:
/* Copyright (c) 2005 Ian Piumarta
 * 
 * All rights reserved.
 * 
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the 'Software'), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, and/or sell copies of the
 * Software, and to permit persons to whom the Software is furnished to do so,
 * provided that the above copyright notice(s) and this permission notice appear
 * in all copies of the Software and that both the above copyright notice(s) and
 * this permission notice appear in supporting documentation.
 *
 * THE SOFTWARE IS PROVIDED 'AS IS'.  USE ENTIRELY AT YOUR OWN RISK.
 */
Quote:
Secondly, "lower algorithmic complexity" does not automatically translate to "faster". (Thus, it's important to not conflate the two!). Constant factors of course also matter a lot, but even more importantly: The tradeoff between RAM latency vs. throughput on modern CPUs means that any kind of pointer-chasing is penalized to an extreme degree. (There are so-called cache-oblivious algorithms/data structures which can alleviate many of these things, but often at a cost of greatly increased implementation complexity.)

Thirdly, while you're right about the algorithmic complexity of (AVL) trees, this is not the end of the story.
Unless you have huge[0] numbers of items and/or truly need the range support of trees, you're almost always
better of with a simple flat array of as-flat-as-possible structs. For searching simply do a linear search[1]. The reason this is almost always better is simple: CPUs these days a) have huge caches, and b) do prefetching like you wouldn't believe. (Those two things mean that any kind of intensive pointer-hopping -- like that required by trees -- incurs a huge latency penalty wrt. accessing RAM.)
Yep - well aware that reducing algorithmic (Big-O) complexity isn't the be all and end all. You will notice that I did use a fairly simple array that I use realloc() to append fixed-size chunks to to map sound id's to sound data. I _could_ have used a tree or a list, but the array is by far the best option beacuse:
a) Elements are never removed until shutdown
b) Element id's increment linearly from 0 as new sounds are identified

Quote:
I believe the scenario you're describing here is: a) data is only loaded/inserted once, and b) you're looking at <= 10000 entries (so "not huge"), and c) you don't need huge amounts of in-line data for each entry.

I would recommend something like the following:

Code:
   struct entry {
       uint32_t key_hash;   // Some simple hash of the string in "key"
       const char *key;   
       // ...
       uint8_t *sound_data;    // I'm assuming loaded sound data would go here
   }
Load all your data into a simple array of "struct entry". When you need to find something simply hash your lookup key, do a linear search for a matching "key_hash", making sure the real key actually matches (otherwise, skip to next).

I'm basing the use of "key_hash" on the assumption that you want to look things up by name and the fact that you'd need to do pointer-chasing to compare anything against "key" (plus strcmp is linear-time, so you want to avoid it in an inner loop). If that's not the case and you have some simple constant-size key, just use the key directly.
The issues are:
  • Sounds are defined by a string name that we have to keep referring to while we are processing the sounds preferences. Every time we process a sound entry in the pref file, we need to match it to an id (or create a new id for the sound). The sound id is what is used to store which sounds are used for given messages (i.e. we do not store the 'name' of the sound, otherwise we would end up with a ton of allocated strings to hold the names.
  • We don't come across sound names in the pref files in alphabetical order, so if we store them in an array, we will be doing a lot of linear searches using strcmp()

Quote:
Since the sound data is huge, it doesn't matter that we access that through a pointer indirection. (Heck, you could probably load it from file every time you need it.)
The sound data itself is stored in a simple array and referenced by 'sound id'. Remember, the 'sounds tree' maps each sound's textual name to it's associated id. It is only hit when we process a pref file. Remember, we can process a pref file at ANY time during game play - it's not a one-off thing. We need the sound name/id map to stick around in case with mess with the message/sound map.

Quote:
Fourthly, Most non-trivial tree algorithms are much more complicated implementation-wise than a simple linear search. Unless you have a good test harness (e.g. QuickCheck or something like that), you're likely to have bugs in such an implementation.
Hence why I 'stole' the code

Quote:
[0] Obviously what's "huge" varies and you'd really need benchmarks to pinpoint the exact number in this exact case. Plus it may depend on cache sizes, etc. However, a good ball-park estimate might be anything from 10000-1000000 depending on the size of the "struct" you need to store.
It's more than just how much data you're storing - it's also about how often you need to perform operations on the data (find, insert, delete)

Quote:
[1] Even sorting the array and doing a binary search would likely be slower than linear search simply due to the pointer-chasing that this incurs.
Don't forget that we are dealing with arbitrary, randomly ordered, variable length, string keys here. So if we use a binary search on an array of struct containing the key/data pair, the keys are going to be splattered all over memory anyway because of the extra level of indirection need for the char *sound_name

Quote:
EDIT: I should say... none of this probably matters very much on modern machines for the number of elements we're talking about (and because lookups would be so rare, comparatively speaking), so I'd say the most important thing here is: Do the simplest thing possible (i.e. avoid bugs)... which coincidentally is also an array .
The message<->sound mapping used to be in an array of size [MAX_MESSAGES][16] (i.e. a maximum of 16 randomised sounds per message). This is kind of OK if we consider that MOST messages had a sound allocated to it. It still hasted a lot of memory, but it wasn't THAT bad considering the reduced code complexity.

But now Nick is thinking forward to the possibility of moving message strings into a data file - the number of messages if going to grow to about 350 or so. That's going to be something like 20k of MOSTLY unused memory for the message/sound map.

NOTE: I have a background in embedded development - for me, every byte counts. I'm more than happy to ditch the message<->sound tree in favour of a fixed 20kB array, but I'll need to be convinced on the sound name<->sound id map.
calris is offline   Reply With Quote
Old March 26, 2016, 09:43   #28
AnonymousHero
Veteran
 
AnonymousHero's Avatar
 
Join Date: Jun 2007
Posts: 1,367
AnonymousHero is on a distinguished road
(Oh, great -- forum doesn't nest quotes properly when replying. Well, I hope get this right...)

Quote:
Originally Posted by calris View Post
It does have a license, and it is as permissive as they come
Yup, noted in an edit of my original. I guess you might have been replying while I was editing .

Is that wording standard MIT license? The important bit here is that it's actually GPL-compatible and known to be.

Quote:
Originally Posted by calris View Post
The issues are:
  • Sounds are defined by a string name that we have to keep referring to while we are processing the sounds preferences. Every time we process a sound entry in the pref file, we need to match it to an id (or create a new id for the sound). The sound id is what is used to store which sounds are used for given messages (i.e. we do not store the 'name' of the sound, otherwise we would end up with a ton of allocated strings to hold the names.
  • We don't come across sound names in the pref files in alphabetical order, so if we store them in an array, we will be doing a lot of linear searches using strcmp()
Use a hash-of-the-string for a quick rejection check when searching -- that's constant time and you don't incur the pointer indirection forced by strcmp() unless you need to. (Which I noted in my message.)

Quote:
Originally Posted by calris View Post
Remember, we can process a pref file at ANY time during game play - it's not a one-off thing. We need the sound name/id map to stick around in case with mess with the message/sound map.
So you need to possibly add entries... so you might need to resize the array, but other than that I don't really how this affects the use of an array. (See below.)

Quote:
Originally Posted by calris View Post
Hence why I 'stole' the code
Indeed. (And that's a good, thing, don't get me wrong.)

Quote:
Originally Posted by calris View Post
It's more than just how much data you're storing - it's also about how often you need to perform operations on the data (find, insert, delete)
Indeed, but my point was that all of those operations are faster on an array than on a tree -- assuming a non-huge number of elements. ('Insert' needs to resize sometimes, but that can be amortized by using exponentially increasing increments, and 'delete' just needs to mark elements as deleted rather than moving elements around. Obviously 'find' and 'insert' need to account for that marker.)

Aside: If you want a tree which is more cache friendly, etc. you'll probably want something more like a B+ tree -- or a cache-oblivious tree. (Yes, B+ trees were originally for use with disks, but these days you CPU cache hierarchy has very similar properties to disk I/O. The latencies are obviously lower overall, but there's an order-of-magnitude thing going on, and prefetching means that the larger nodes are basically free.)

Quote:
Originally Posted by calris View Post
Don't forget that we are dealing with arbitrary, randomly ordered, variable length, string keys here.
Doesn't matter -- if you use a hash of the key in each entry for a quick rejection check, as suggested.

Quote:
Originally Posted by calris View Post
So if we use a binary search on an array of struct containing the key/data pair, the keys are going to be splattered all over memory anyway because of the extra level of indirection need for the char *sound_name
You shouldn't use a binary search on the array! (Which I believe I also mentioned, but it may have been in an edit.)

The point is to avoid jumping around in memory and a binary search means that prefetching goes out the window. (There are also some surprising cache-related subtleties around binary search, btw.)

The throughput of the CPU-RAM (given prefetching) is such that a linear search is faster. (Again, assuming non-huge numbers of items.)

Quote:
Originally Posted by calris View Post

But now Nick is thinking forward to the possibility of moving message strings into a data file - the number of messages if going to grow to about 350 or so. That's going to be something like 20k of MOSTLY unused memory for the message/sound map.

NOTE: I have a background in embedded development - for me, every byte counts. I'm more than happy to ditch the message<->sound tree in favour of a fixed 20kB array, but I'll need to be convinced on the sound name<->sound id map.
Your phone probably has >1G. End of story regarding memory usage, AFAICS .

(I don't mean to be disparaging. You'd definitely be right if we were talking "real"[1] embedded, but we aren't.)

[1] As opposed to the "faux" embedded I used to do where "embedded" really just meant a Linux running on a machine with about a third of what a regular PC of the time had... and stuffing said (small form-factor) machine into some enclosure along with a few other bits of peripheral hardware.
AnonymousHero is offline   Reply With Quote
Old March 26, 2016, 10:02   #29
AnonymousHero
Veteran
 
AnonymousHero's Avatar
 
Join Date: Jun 2007
Posts: 1,367
AnonymousHero is on a distinguished road
Since you mentioned it: On the subject of realloc(), I think mem_realloc() is buggy. I you see L76: This seems incorrect in that the addition here happens before the NULL check... thus ensuring that the "NULL" case can never be triggered. (It's usually not a problem on Linux due to overcommit, but malloc() can sometimes return NULL even if overcommit is allowed.)

It also seems to be attempting some sort of "poisoning"/size tracking to protect against corruption, but this is empatically not what one should be doing these days. Just use UBSAN and ASAN. They do this fully automatically and work for all memory accesses, including regular array indexing.

EDIT: Fixed URL

Last edited by AnonymousHero; March 26, 2016 at 10:26.
AnonymousHero is offline   Reply With Quote
Old March 26, 2016, 10:04   #30
calris
Adept
 
Join Date: Mar 2016
Posts: 194
calris is on a distinguished road
Quote:
Originally Posted by AnonymousHero View Post
Use a hash-of-the-string for a quick rejection check when searching -- that's constant time and you don't incur the pointer indirection forced by strcmp() unless you need to. (Which I noted in my message.)
Ah, now I get it - Hash the string into, say, a u32 and just shove it into an array (which we can realloc be some fixed chunk size) and forget about sorting, just do a linear search. So all we need is either
  • A hash algorithm which guarantees no duplicate keys for non-equal strings, or
  • Some way to manage potential duplicate keys.

I'm open to any ideas you have to offer.

I can immediately replace the message<->sound id tree with a fixed size array - It makes my skin crawl to waste the memory, but performance is king here
calris is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Reworking the entire magic system (yay)! TricksterWolf Development 8 September 20, 2015 21:22
Mk build system? CJNyfalt Development 11 March 25, 2015 08:25
Resist system Jungle_Boy Development 65 August 30, 2011 04:10
Combat System Sirridan Development 9 July 14, 2009 08:11
RFC: Middle Earth map for Un in ASCII; is it readable? Bandobras Variants 13 November 25, 2007 04:15


All times are GMT +1. The time now is 22:36.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2019, vBulletin Solutions Inc.