Word-boundary matching only works as intended in English and languages
that use similar word-breaking characters; it doesn't work so well in
(say) Japanese, Chinese, or Thai. It's unacceptable to have a feature
that doesn't work as intended for some languages. (Moreso especially
considering that it's likely that the largest contingent on the Mastodon
bit of the fediverse speaks Japanese.)
There are rules specified in Unicode TR29[1] for word-breaking across
all languages supported by Unicode, but the rules deliberately do not
cover all cases. In fact, TR29 states
For example, reliable detection of word boundaries in languages such
as Thai, Lao, Chinese, or Japanese requires the use of dictionary
lookup, analogous to English hyphenation.
So we aren't going to be able to make word detection work with regexes
within Mastodon (or glitchsoc). However, for a first pass (even if it's
kind of punting) we can allow the user to choose whether they want word
or substring detection and warn about the limitations of this
implementation in, say, docs.
[1]: https://unicode.org/reports/tr29/https://web.archive.org/web/20171001005125/https://unicode.org/reports/tr29/
This should eventually be accessible via the API and the web frontend,
but I find it easier to set up an editing interface using Rails
templates and the like. We can always take it out if it turns out we
don't need it.
The intent of the previous concatenation was to minimize object
allocations, which can end up being a slow killer. However, it turns
out that under MRI 2.4.x, the shove-strings-in-an-array-and-join method
is not only arguably more common but (in this particular case) actually
allocates *fewer* objects than the string concatenation.
Or, at least, that's what I gather by running this:
words = %w(palmettoes nudged hibernation bullish stockade's tightened Hades
Dixie's formalize superego's commissaries Zappa's viceroy's apothecaries
tablespoonful's barons Chennai tollgate ticked expands)
a = Account.first
KeywordMute.transaction do
words.each { |w| KeywordMute.create!(keyword: w, account: a) }
GC.start
s1 = GC.stat
re = String.new.tap do |str|
scoped = KeywordMute.where(account: a)
keywords = scoped.select(:id, :keyword)
count = scoped.count
keywords.find_each.with_index do |kw, index|
str << Regexp.escape(kw.keyword.strip)
str << '|' if index < count - 1
end
end
s2 = GC.stat
puts s1.inspect, s2.inspect
raise ActiveRecord::Rollback
end
vs this:
words = %w( palmettoes nudged hibernation bullish stockade's tightened Hades Dixie's
formalize superego's commissaries Zappa's viceroy's apothecaries tablespoonful's
barons Chennai tollgate ticked expands
)
a = Account.first
KeywordMute.transaction do
words.each { |w| KeywordMute.create!(keyword: w, account: a) }
GC.start
s1 = GC.stat
re = [].tap do |arr|
KeywordMute.where(account: a).select(:keyword, :id).find_each do |m|
arr << Regexp.escape(m.keyword.strip)
end
end.join('|')
s2 = GC.stat
puts s1.inspect, s2.inspect
raise ActiveRecord::Rollback
end
Using rails r, here is a comparison of the total_allocated_objects and
malloc_increase_bytes GC stat data:
total_allocated_objects malloc_increase_bytes
string concat 3200241 -> 3201428 (+1187) 1176 -> 45216 (44040)
array join 3200380 -> 3201299 (+919) 1176 -> 36448 (35272)
It would also have been valid to get rid of the attr_reader, but I like
being able to reach inside KeywordMute::Matcher without resorting to
instance_variable_get tomfoolery.
A matcher object that builds a match from KeywordMute data and runs it
over text is, in my view, one of the easier ways to write examples for
this sort of thing.
Gist of the proposed keyword mute implementation:
Keyword mutes are represented server-side as one keyword per record.
For each account, there exists a keyword regex that is generated as one
big alternation of all keywords. This regex is cached (in Redis, I
guess) so we can quickly get it when filtering in FeedManager.
On desktop, the compose text box grows to accommodate the content. On
mobile, the text box does not grow to accommodate text context, but does
grow to accommodate images. It is possible in both cases to overflow
the available area, which makes accessing other UI elements (e.g.
visibility setttings) difficult.
This commit makes the compose area optionally scrollable, which allows
those UI elements to remain available even if they go off-screen.
Commit 6e54719474 moved the Mastodon
variables and mixins deeper in the directory hierarchy; this commit
brings the glitch components in line with that change.
* Swedish file added
* Swedish file added
* Swedish file updated
* Swedish languagefile added
* Add Swedish translation
* Add Swedish translation
* Started the Swedish translation
* Added Swedish lang settings
* Updating Swedish language
* Updating Swedish language
* Updating Swedish language
* Updating Swedish language
* Updating Swedish language
* Updating Swedish language
* Swedish language completed and added
* Swedish language Simple_form added
* Swedish language Divise added
* Swedish language doorkeeper added
* Swedish language - now all file complete
* Swedish - Typos and supplementation in sentence structure
* Update simple_form.sv.yml
* Update sv.yml
* Update sv.yml
Rearranged the alphabetical order.
Specifically, this fixes status length calculation to be same as JS side.
BTW, since this pattern used in not only preview card fetching, we
should extract it (with twitter-regex?) and write tests I think.
* Clean up reblog-tracking sets from FeedManager
Builds on #5419, with a few minor optimizations and cleanup of sets
after they are no longer needed.
* Update tests, fix multiply-reblogged case
Previously, we would have lost the fact that a given status was
reblogged if the displayed reblog of it was removed, now we don't.
Also added tests to make sure FeedManager#trim cleans up our reblog
tracking keys, fixed up FeedCleanupScheduler to use the right loop,
and fixed the test for it.
* Swedish file added
* Swedish file added
* Swedish file updated
* Swedish languagefile added
* Add Swedish translation
* Add Swedish translation
* Started the Swedish translation
* Added Swedish lang settings
* Updating Swedish language
* Updating Swedish language
* Updating Swedish language
* Updating Swedish language
* Updating Swedish language
* Updating Swedish language
* Swedish language completed and added
* Swedish language Simple_form added
* Swedish language Divise added
* Swedish language doorkeeper added
* Swedish language - now all file complete
* Keep references to all reblogs of a status on home feed
When inserting reblog: Add to set of reblogs of this status on
the feed, if original status was present in the feed, add it to
that set as well.
When removing a reblog: Remove it from that set. Take random
remaining item from the set. If one exists, re-insert it into feed,
otherwise do not re-insert anything.
Fix#4210
* When original is removed, toss out reblog references
Fix#5398
Ordering the home timeline query by account_id meant that the first
100 items belonged to a single account. There was also no reason to
reverse-iterate over the statuses. Assuming the user accesses the
feed halfway-through, it's better to have recent statuses already
available at the top. Therefore working from newer->older is ideal.
If the algorithm ends up filtering all items out during last-mile
filtering, repeat again a page further. The algorithm terminates
when either at least one item has been added, or if the database
query returns nothing (end of data reached)
We've changed un-reblogging behavior when we implement Snowflake, to insert un-reblogged status at the position reblogging status existed.
However, our API expects home timeline is ordered by status ids, and max_id/since_id filters by zset score. Due to this, un-reblogged status appears as a last item of result set, and timeline expansion may skips many statuses.
So this reverts that change...reblogged status inserted at corresponding position to its id.